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Sia Sik se’ halt Sebamed nace us Goes aniparnaoca a abiabese 
management system is known as attrifete paftitioning. This is the process of dividing the 
attributes of a file into subfiles that are wored separately. By storing together those attributes 
ee en ee ee 
attribute partitioning can reduce the number of pages that must be transferred from 
secondary storage to primary memory in order Wo process a transaction. 

eu The goal of this work is to design mechantems thut can automatically select a. 
near-optimal attribute partition of a fite’s attributes, based on the usage pattern of the file 
and on the characteristics of the data in the file. The approach taken to this problem is. 
based on the use of a file design cost estimator and of heuristics to guide a search through 
the large space of possible partitions. The heuristics propose a small set of promising 
partitions to submit for detatted anatysis. The estimator assigns a figure of merit to any 
proposed partition that reflects the cost that would be incurred in processing the transactions 
in the usage pattern if the file were partitioned im the proposed way. We have also 
conducted an extensive series of experiments with a variety of design heuristics; as a result, 
we have identified 9 heuristic that nearly always finds the optimal partition of a file. 
The context of this study is a. relational database management system that can 
process transactions made against relations whose physical partitioning is unknown to the 
user. In specifying and medeling this system, it is necessary to address the problem of | 
optimizing monprovedurat queries made to a partitioned file. We have derived. a number of 


_ such optimization cra have ie ales: the resuks of a number of experiments with 
them. 
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CHAPTER 1 


INTRODUCTION — 


The work to be reported in this report is part ‘of an ‘ongoing ‘research effort to 
develop a self-adaptive database management spatem. “The intent of this development is 
twofold: to ‘develop the techniques and methodology for the construction of such systerns, and 
to’ identify database physical design issues with techniques for their ‘automatic and optimal 
determination. In this report, we address the problem of optimizing the performance of a 
self-adaptive database management sytem, ina “dynamic environment where access 
requirements are continually changing, by automaticaly partitioning the attributes (fields) of 
the file. Attribute partitioning is the. task of dividing. the attributes (elds) of a ‘file Into 
: non-overlapping groups and then storing each roup in a 4 separate physica file. a . 


“1. Self-adaptive_ 


It is important that a database system. perform efficiently at all times. Efficient 
performance requires that the physical organization of the database match the usage pattern 
of its users. Thus, as the:database’s usage pattern changes ever time, its organization and its” 
access structures. can become obsolete, with consequent degradation of performance. 

"Performance degradation may also result as: the database grows in size or as the nature of the 
data it contains changes. After some time, the performance of the database system may. 


deteriorate sufficiently so as to compel-a database reorganization. Since the applications 
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programs accessing the database we continually heing akered with new applications 
programs replacing old ones, and since the canines of the database continually undergoes 
change, the reorganizetion of the database's physienl struchire must be an ongoing precess. 

Conventionally, the databace administrator determines when: and how to reorganize — 
a database. His desision. e based on Yann Information about the database and the 

trantactions perfarmed on it; conan it decom ia lnrgely an. ineniaive guess. For large 

database, a more syetornnin. meant of acquiring infarmation abaut. datmbase usage and a 
mare sheen istnnle way ef evaluating the casts of alternative configueatiqns e exsential It has 
recently been proposed teak database management apntams. be welf-adaptive, and automatically 
reorganize themselves a the need. aries Ea, 12}. Hinnuner. Ua) dium 8 methodology for 
monitoring database uenge pattern, and. devcrbae the eal Principies for a self-adaptive, 
self-reorganizing database management system. 

A minimal capability of a self-adaptive database management system should be the 
ireoepenstian of a monitoring mechanism that collects unage statistics while performing 
transaction processing. A database management ane ‘ well suited for gathering and 
analyzing infermation on it: awn usage and performance; and if the gathering and analysis 
of the usage and performance information is. done appreprintely, the aueciated everhead can 
be minimal. In addition, & ulfadaptive dambass menagement syst shoukd be able to come 
up with desirable physical organisations (he. desirable data structures and neces structures) 
ised | UpOR the collected atten, and bo able tm evaluate the nit of each ahernative 
organization in order te select an optimal physical organization for a database. Also, it is 
; possible ‘that sucha system, enighst perform the necessary database reorganization itselfyafter it 


has evaluated the cost/benefits of reorganization and the associated costs of retranslating the 
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applications programs that access the database. 


2.___The Relational M Jel OF 74 


In a self-adaptive database management system, the physical organization of the 
database is perpetually being reorganized. in order for the ‘database reorganization to be 
truly effective, ‘a database management system that performs. self-reorgantzation will have to 
“manifest the. following two important “characteristics: nee data Independence between the 
database's physical organization and the application prograins that access the database, and 
2: nonprocedural access of the contents of the database. By data independence we mean that 
users and their application programs are’ fot “required “bo ‘know the ‘actual physical 
organization used to fepreserit the data, so that they-are free to concentrate on a logical view 
of the data. Data independence makes the database easy to use and avoids the need for 
spolication, progeans retranstation every time.the database's physical structure is changed. 
Nonprocedural access also makes the database easy to use; this entails the provision of access 
languages wt which attow the specification se desired data in terms de! properties it poultice 
rather than in terms of the search algorithm used to locate and retrieve it. | 

| The retational model of data (Codd (9) has been proposed as a means of achieving 
the. above goals of data independence arid nonpracedural access. The relational data model 
provides a simple and uniform logical view of the data that is completely independent of the 
actual storage structures and access structure used to represent and access the data. This 
makes the definition and manipulation of a durawase independent of its underlying physical 


 arganization. Asa result, changes at the data. organization level need not be reflected in the 
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be Sal that access the database. 

A relation in the relational data model is a named. two. dimensional table that has a 
fixed number of attributes (columns) and aft arbitrary number of unordered tuples (rows). 
All the rows of the table: have. te be unique... A. tuple representing 2in_entry in a relation 
contains a value for. each attribute of the relation. The number of attributes in a relation is 
mand the number of tuples. in the relation is on. Figure | shows the relation 


ENROLLMENT for students enrolted in comrees. ‘The EN Terelation has two 


ial tiie eae 
re 


attributes ‘Student and Course, and four tuples (Doe, 6.038), (Poe, 6.032), (Doe, 6.851), and 
(Roe, 6.035). The physical realization of a relation is often called a file, with the attributes 
and the tuples of the retation called the fields and records of the file respectively. Henceforth, 
we will use the term file when discussing the totality of the data in a retation, indicating that 


we are dealing with the physical representation of the relation. However, we. will continue 


‘ENROLLMENT: 


Attribute 


_ Student Course 


6.035 


| { iG 
| Poe | 6.092 | 
Tupie fe 4 | 
| Doe | 6.851 | 
ee: { 
.| Roe | | 
| | | 


Figure 1 The ENROLLMENT relation with 2 attributes and 4 tuples. 
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using the terms attribute and tuple for the fields and records of a file so that the two 
dimensional tabular data format of the relation will be kept in mind. 

In a relational database management system, the user's view of | the database is 
independent of the details of the database's physical organization. Furthermore, his 
nonprocedural ale are far removed from the primitive data: manipulation ‘operations for 
| locating and retrieving the data. Consequently, more — is placed on a relational 
forms: - choosing an efficient heel organization for the relation, and 2- optimizing the 
process of finding answers to queries made to the databace, y the means of efficient and 
judicial use of the avattable access structures. 

We believe that the selection of a good physical organization is the primary issue in 
. relational ‘database implementation, since: the ‘effictency'-that ‘can be achieved by a ‘query 
optimizer is strictly delimited by the avaliable access structures: A Furthermore, the: efficient 
utilization of a database. is highly dependent on: the optimal matching of its physical 
organization to-its access requirements, #s- welfias tothe other: database characteristics (such as 
the distribution of attribute vatues in it). Hence, the usage pattern of a database should be 
_ ascertained and utilized in choosing the physical organization. 
‘There are numerous possibilities for the physical sceentain of a relational 
_ database. The selection of a particular physical organization rust be based on minimizing 
the performance cost in terms'of both data access cost and data storage cost. The subject of 
this research is selecting the optimal attribute partition of a relational database by utilizing 
the access siainern’ History of the database ‘in vider th emtvvb thre data access cost. Attribute 


partitioning consists of dividing the attributes of a file into subfiles that are stored separately. 
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In relational terms, this ‘means splitting a relution into w number uf subrelations, each 
containing a subset of the attributes of the original relation, such that the original relation 
may be uniquely reconstructed from the cotfection of the sebrelattons. -Gtrictly. speaking, a 


subfite is not a retatiow tn that Guplicate tuples are’allowed: ‘We wilt give'2 formal defirition 


of a subfile in the nent section.) 


Let Abe the st of atributes ofa reton, and tT be the ot of tuple Keni 
of the relation. (A tuple identifier is ‘ unique identifier for a. tiple in the relation.) The 
number of attributes in the relation is Al =m, and the somber of tupiesin the relation is 
[Tl =n. Consider the collection of subfiles F- 7 ily » Where each subfite F; is defined bya 
pair consisting of 3 an Pecolerinead set and a tuple eauemone! set, which: = epg the attributes and 
tuple identifiers cae are € represented in the sufi Fy: (ys Ws : As A, TT... The 
collection of subfites F is aan the busca of the relation, and: can have two basic forms: 

I- an attribute chuster in ‘which 


T; =T , : i= Ly.wny M and 


= a tuple cluster in which 
AA | i=1,-M and 
T,=T. 


The tuples of a acne are ‘called  sobuplen A. subfile F, of an attribute cluster 


contains «n subtuples: one. for each tuple i in ns original relation x eer of a subfile is 
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that part of the original file's tuple that has attributes Ay. The subtuples of subfile F, in an 
aitribute cluster need not au be different.. For example, if the relation of Figure 1 is clustered 
"such that F,=(A,, 7) isa subfile with A, = {Course}, then the subtuples of Fy will be 
(6.035), (6.032), (6851), and (6. 026), 
An attribute cluster (FNM 1 in which Ai Nn Am é for, all { # } is termed an attribute 
partition. of the relation. In this report, we wil limit our: attention to the topic of attribute 
‘partitioning. (A discussion of tuple partitioning appears: in Section 7.2.) Attribute 
partitioning is the task of dividing the attributes of a relation and storing each disjoint subset 
of attributes in a separate subfile. The objective of attribute partitioning is to construct an 
attribute partition of a relation that optimizes the performance of the database management 
system by minimizing the cost. of locating and retrieving data. Intuitively, attribute 
partitioning is accomplished by assigning attributes to the same subfile whenever . they are 
consistently requested together by queries. - | | 
| In conventional database management systems (with paged memory organization), 
each tuple ofa relation is stored with al its attributes together in one file. When a query is 
er to the database, atl spie that are required by the aan are brought into saininey 
memory by retrieving all the pages that the tuples reside on. It has been observed in 
practical ‘database applications that a ‘aiiy does at usually request, all the attributes of the 
file; most queries reasesk only a few of the stiriouies Problems. are presented by the 
co-existence in the same file (or sectvalenety in the same tuple) of attributes that are not 
requested by the query together with the few attributes that are. Whenever the requested 
attributes are retrieved, the non-requested attributes will alto have to be brought into 


primary memory. If only a single tuple is needed to answer such a query, then it really does 
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not matter that other attributes that are not requested happen to reside im the same tuple with | 


the requested attributes; in any event, onty one page feeds to be retrieved from secondary 


storage. On the other hand, tually more thet “one ‘teapile must be retrieved in order to 


answer a query. Luaneincasclnngichernaneaes 


there is then a ‘higher pi chubatny Miata oe iat ks BS 


isan ine sie we: what. it must 
tninimatly be, pee ee We: it ze ot nigh ms pot page oo and consequently causing excess 
2g queries that reget ony Tew ot the ‘attributes white accessing 


page accesses when answeré 


more than ‘one tuple. “Theviore if a Me ‘ie’ partitioned. sch that ‘attributes that are 


consistently requested Pre 


: eparived from ‘those attributes with ‘which they are ‘not fegoctd, then ‘the number of page 


accesses required to retrieve these. attributes wal be reduced over the swuirber required froma | 


les 5M 


file that is not so partitioned. 

On the other ‘hand, ‘ndisetiminarely separating a ‘attributes and. storing. each in a 
4 separate subfile will also fesiftt ‘in exenss. pee accesses. “This is ‘because 8 a query that requests 
the attributes of a subifite together swith some other attributes that are ‘not in the subfile will 
incur more page accesses than when all these attributes. arein the sa same subfile, since now the 
two groups of attributes eside in different subfiles and on separate pages. When a. fite has 


been partitioned into subfites, queries requesting atteibutes eee together in one subfile will 


become less costly to answer white ‘those queries ‘that. have their requested attributes . 


tn ee page) 


r by querer ae gta tgeter fino We-ame ube and 
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distributed over more than one subfile will become costes to answer. Ineinve: the optimal 
partition is the partition des maximizes the cost t reduction for the first kind of queries while 
minimizing the cost incase for the second. kindof queries. 

Attribute partitioning. is. most-useful fer large databases. where queries made to the 
database usually request only.a few attributes of each. tuple. At-is conceivable thint:the-recprest 
distribution that has been observed $04 tuples, requegiied.dog.quartes-alee applies: to- attributes 
requested by queries. It has been observed in: mang:practical-databese applications that not 
all the tuples of a database are requested with the..same frequency. ‘The "80-20" rule of 
thumb for the distribution of tuple request frequencies (Hetsing {17]) states that approximately 
80 percent of queries request .the 20 percent: most frequently requested tuples in a: fite. 
Furthermore, the rule. also spies the 20 percent. most. frequently requested tuples in the 
file; ie. that 64 percent of the queries request 4: percent of the most frequently requested | 
tuples, and so forth. If this is also true for the request frequencies of attributes by queries, 
then most requests are only for a few active attritutes:of the fite. 

An example of a large database, where attribute. partitioning may be useful, is the 
Navy Command and Control Data Base (0), This detabase-consists.of a few relations with 
| any tuples per relation. Some of these. relations ie bi mans as 35 attributes and a tuple - 
length of 64 words. Queries are on-line, and predominantly: involve only a. few. attributes. 
Some suributes of the file like the name of ships or the class of: ships are. prequersty requested 
by queries midis other attributes like the diameter of the torpedo tube are seldom requested. 
Therefore, parnitioniay the attributes of the fites may resuk in considerable savings in page 


access requirements. 
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4. Thesis Objective 

_ To summarize, the principat:gdal of this’ report “por ta to develep pAep techniques for attribute 
partitioning ina selfedaptive relational database envionment. Fiot this purpose, we have 
assumed a database management System that supperts partitioned files and we have ‘built an 
attribute . partitioning system consisting of a snadet for ‘the assurhed database management 


system and ‘a set of altribate partitioning teurtates. The artrioule pi ing heuristics — 


select a “partition for a database managed by a Gatebase wianagemient systent stmilat ‘to the 
one we have modetied. Atthough our model fat “Wne of any ‘existing system, ft is 
representative of practical Systems. Our thrust in building this-imedél ‘has’ been to avoid 
many of the simplifying assumptions vate in previows ‘wedies, and thereby emphasize 
important aspects of reatisth: datebuse environments: ‘We wrest the Heed for monitoring the 
database management systern and acquiring parameters on the datubbse thige ‘pattern and on 
the evolving characteristics of the database itectf.” We describe 'a nuthollstegy for processing 
transactions made to a partitioned database butt with various access structures, and develop 
‘a complete and accurate model of the cost of atcessing the subines when performing such a 
transaction. Finally, ‘we concern ourself with Weurtitke techniques ‘thit ‘utilize the acquired 
parameters and produce optimal or near ‘optimal atrribute partitions for the datibase ata 


reasonable computational cost. 
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5. Thesis Organizatio on | . 


The rest of this report.is organized. as folipws. ‘Chapter 2. summarizes a number of 
previous itudies in the area of attribute partitioning, and in the context of evaluating them, 
seuss for the need of a heuristic solution to realistic ‘database attribute partitioning 
problems. In Chapter 3 we provide the model of the uadertying. database management 
system that we have considered: the physical stosage anadure, thes ‘access structures, the 
transaction ‘model the method of processing queries in the partitioned environment and 
| techniques for the acquisition of the parameters. needed. for our: cost analysis. in Chapter 4 
we present the cost analysis for various basic operations .on..2 partitioned .database and 
describe how to compute the database's performance cos, which is what we wish to minimize. 
| Then in Chapter 5 we presenta number of attribute . partitioning heuristics that we have 
devised, along with the motivation. for their consideration. We also discuss the comparative 
advantages and disadvantages of each heuristic, and outline. how each heuristic has 
performed in.a series of experiments. Chapter 6 poses se attgibute partitioning problem for a 
relation with 8 attributes, and describes its solution using the heuristics of the preceding 
chapter. Finally, Chapter 7 concludes. the report with suggestions-on how to extend the 
underlying environment in order to solve more realistic attribute partitioning problems, and 
also discusses the relationship between database attribute partitioning and. other physical 


_ database design issues. 
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THE APPROACH TO DATABASE ATTRIBUTE PARTITIONING 


The purpose of this chapter is dicnonasica ta scapes we have taken to solving 


the attribute partitioning. protien, ‘and to contrast Ae-with the pproad 


ach | taken by others in 
| determining the optimal attribute ‘parton T here ‘re two° rior ‘approaches to attribute 
- partitioning, each approach: raving its own metits nec miatons, ‘The two approaches are: 
Il- the integer programming spprench, whic 7 the appriach taken by most other 
researchers, and 2- the heuristic approach. We have’ chosen the heuriat approach for the 
following reasons: {- More _— database exvvtrhimenes an be handled by the heuristic 
spproeee than by the integer Prograrereng. sppronch, e An optional or near optimal 
attribute partition car be found rinich more efficiently by the heuristic ‘approach than by the 
integer programming approach, and $- _ Atthough ‘the heuriitic’ Approach ‘(unlike the integer 
_ programming. approach) does not. guarantee that the optimal pattition will eventually be 


6. sephycade: 


y found an: optimal or near sail 


found, the heuristics we have employed trite 


partition for the attribute partitioning problem. 


The tdea of clustering attributes (and ahd attribute partitioning) as a means of — 


improving the performance of a database management system has often appeared in the 
literature on file design aiid’ opetmization. Unitit the paper of Kennedy (21), there had been 
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little systematic study of this. aspect of file. organization. Further, the conversion. of a relation 
_into second and third normal forms [10] was sormetianes confused .with attribute clustering. 
Although normalizing a relation into its normal forms. may. sesult in the clustering of 
attributes, and thereby reduce page accessm, normalization is. directed towards improving the 
-togical data scherria rather than enhancing, database system. pesformance. Ik is the functional 
dependencies among the attributes that govern the splitting. of relations in the process of 
normalization, rather than the data’s. physical. characteristics, or. the, database usage pattern. 
An example of on in the. area of relational database normalization Is ‘that of Delobel and 
| Casey U3) They are concerned. with the Problem. of decomposing per into.a set of 
| subrelations such that the ‘information content and logical ‘relationships of the. original 
relation schema: are presetved: However, they do not consider physical autabede criteria that 
would result in a physically optimal decompasition of the relation schema. 
| Implementations of database management systems that.support partitioned. files have 
been few, and have been limited to simplified environments where finding an. qpelimal ora 
suitable partition is relatively easy to manage. Moreover, in. these ‘crplaieesitation’s attribute 
partitioning has been treated only asa one-shot affair, to be determained at the initial creation 
of a file. Attribute partitioning has.not been viewed as‘a. database. organization issue that 
needs to be reconsidered periodically, where the ening should.be done by a self-adaptive 
database management system. 
There have been a number of previous studies of attribute partitioning and attribute 
. clustering (Day On, Sepp&l4 [32], Osman [291 Yue and Wang. [39], Benner (4), Alsberg O), 
Babad (2), Stocker ‘nd Dearniley (35, 12] Kennedy (21, 20], Eisner and.Severance [14], March 


and Severance (23], and Hoffer and Severance, (18,191) However, we feel that the results of 
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these studies are not. disesty applies toa camplete of,- reolianle: daebaie. environment. 
| Some of these have been ferent analyses which bawe made many snphiying assumptions in | 
order to obtain. analtic solutions; others have been. designs that ue tncermplete or unrealistic 
in many ways Que Umum bane te te x0lnsagig, of the aamplijing-aamumnptio 
been made in previews suician and: thus vo develop pile complete and. écurate models of cost 
and database access. In, addilien, we sixes the importance of dgtahese cost analysis and the 
acquisition of accurate pAcam@ters, In 98 enwneamEn where acces requirements are 
continually: changing. ‘Fhig sppact:. of the avatbate partitioning. preblem “has not been 
addressed in previous work, ‘Below, we, presenta. suppinany. of, some of the earlier efforts. in 
attribute. partitioning, eatin’ the: asonernd amieliepenest, of -eaah project along with the. 
thrust of its research... 
Two of the earliest) gapers on. attihule clustering in. a. s€lf-a 

_ Taanagement systema. ace.tay Stocker and. Dearnley (85, 12], They. disguea the implementation of 
ute clustering. (Recall, 


that have 


a self-rearganizing Gatahase management, syetnon te: carci, v 
that in. an attribute luster, ap. atteibute.may east red iy ts, sengral. subfiles.) Stacker 
and Dearnley show that. i@ 9, database managemtet, iystem Whére Morage cost, is low 
- compared to the cast of acqmesiag: the sibfites, st fs benpficialita Cluster. the attributes, since the, 


increase in, storage.cosh wilhhe. mage thap.offset. bythe saxing in’ scones cast, Although they 
do not outline. the. atteigute qkassaning: tech 
which. utilizes graph, theary: ansh. varios 


! linia wae they discuss a query. processor, 


s. heuristics. to. pape ‘queries. made. to. a file clustered 
‘thas. ataribute Ghistating. in. exiting. datab: 
systems. is both viable aad. desirable, 


Kennedy [21, 20].comsiders a mathematical model of attribute partitioning where each 


by its attributes. They coneaade 


"management 
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attribute. a, is of known, length, and has prsbaiy Pi of being reunte by a query. The 
joint probability that attributes 4 and 4 are requested the same query is assumed. to be | 
PP, ie, attributes are ¢ assumed to appear in queries independenty of one arene: A cost 
function based upon this assumption is derived, which reflect the expected amount v4 data 
that must ‘be scannenitted, Ain terms of number of words, from secondary sorage to primary 
memory) in order ts to answer a query. The ncecive hs here ry to shoowe a Partition such that 
‘this: c cost function is ‘minimize. Kennedys model bs a mathemati formulation of a 
simplified attribute partitioning problem in terms st zero-one Mager programming where the 

only parameters are p; and h the feng of atribute a: In addition to many other 
| simplifications Kennedy’ model & assumes ne hat whan a atate bs requested by a query. the 
| ‘subfile © containing that attribute has to be reeved and scanned in its entirety (rather than 
retrieving just those wbtuples of the subfite that are really newted to answer the query). 
| Since in this model, optimality can n always be tilly stained when each subfite contains 
. exactly one suribute, the number of subtle M over” which the ateributes are to be 
distfibuted has to be fixed beforehand ‘(Otherne the trivia parton, defined as the 
partition » where each attribute is in a separate subfite of its own, wil aver prevail) As 
Kennedy'n notes, there is no way y short of cause emer (which is nicuble as shown 
in Chapter 5) to find the optimal sohtion even for this rather siiple model To find the 
optimal solution | to the partitioning problem pow oy fi model he Introduces two. further 
, simp lying assumptions in order to reduce the Anveger programming problem toa aes 
prcbren where mathematical optimization techniques can be applied. One simplitieation ts 
the assumption that all attribute request prbabities a are : ua, and the other simplification 


is that all attributes are of equal lengths. 
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In the. work-of Bamer-and-Geverance: sip aanart partioned into two subfites: 
a primary and a: secondary seiotite. Each: subfile ls lected on a agpeate morage device 
characterized by <itering sorage cost a retoeval: speed. ‘Fhe mnie are eeeignen to 
"each of-the-subfites: wiht redendaney. Bee format the cbyeate com toncton that:is:to 
be minimized are- ‘derived, where the first cost: fametion a a acta caseof the (second): more 
general cost function. Bhe-first cost: fanetion:is tesa of sorage charges for subtupies in 
the primary wubfite, plus-thercost -of meeting athothe scape adi in the secondary 
‘subfile. vise secondary: smalefile 4s scceaedniy-whan 2 qoery sequen an attribute which 
happens to: be residing there) This: east. functian sine -ond.ony be solved yy existing 
_ integer Prograraming teshniques.- ‘Tie ctf finding the-opimal pation for this: function 
by integer. Programming ie-f she onder (mn OP saat am: oe number of: atefibures in 
the file and. Q is: the: sett qui sede-to the database, ‘Fhe cond ebjecive cost function 
is-nentinear, ‘and ereasunes the tot cots of cen traofr.and-tenge for eubtuples | in: ‘both 
"the primary and secondlary subfiles, ‘The aearchroostfor {irding ee opti solution for the 
general nonlinear: objective. cost: function: ee even: higher thant “the: simples bases ‘epst 
function. Fhe: timitations ef the: ime doped Emer aed Severe are: apparent: sanity 
a maximum’ of 4wo subtites ane-attowed and he cot aseciate with prociing ‘a query is 
taken to be the cost of acamaing the whole (primary ot secondary ube fn theentirety rather, 
| than the-cost _of retrieving Jue trae sutuples ofthe soe shat are realy pected to answer 
the query. ‘Furthermererthe ost:of finding: the apt pacthien sing the tmear objective 
cost function grows iw the cube tthe au of ie samba of anioones and the umber: of 
queries lena the- cost: vo) even fats fr the note eben cont fraction) tes met 4s 


too large for practical purposes. 
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March and ‘Severance {23} extend the model of Eisner and Severance to some extent 
by assuming that subtuples are blocked in each subfile into, fixed size pages. (The page sizes 
in the primary and secondary subfiles are not necessarily the same, but the constraint is 
imposed that the sum of the primary subfile page size and the secondary subfile page size is_ 
constant.) The noilaneat objective cost functton thes derive not only depends on how the 

attributes are partitioned among the two subfiles, but also on the page sizes, selected for each 
| of the primary: and secondary subfites. Besides the rather peculiar: paging organization 
| adopted, the model of March and Severance has the additional disadvantage. that it does, not 
contain an accurate model of the cost of accessing subtuples that are selected in queries. 
_ Rather, the pcmaty and secondary subfiles are assumed to be accessed in their entirety 
whenever any of their attributes are requested by a ary (as in the model of Eisner and | 
Severance) Using integer programming techniques, March and Severance obtain the optimal 
partition for their model. However, compared to ihe model of Eisner:and Severance, the cost 
of solving. the inbegee programming sh dela oot even faster as the number of 
attributes and the number of queries made to = database grows. 

Hoffer [18] developes an extensive model for attribute partitioning, in which the 
Gbjective cost function is a linear combination of storage, retrieval, update, and insert/delete 
costs. The problem is formulated in terms of a nonlinear zero-one integer programming 
problem, and is solved by a branch and bound algorithm. In applying the optimization 
algorithm to the formulation, it became obvious that problems of even modest size were 
computationally intractable. In order to use this model to obtain solutions to realistic 

probiews: it became seas to reduce the size of the feasible solution space to a point where 


optimization becomes economically feasible. To this purpose, Hoffer and Severance’ [19] 
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propose an attribute partitioning: method:thatprodwess.an initial and-crvde, but nevertheless 
reasonable, partition of the attributes. This partition is then: passed a4 a:starting point to the 
branch and bound algorithm-of Hoffer 18]:: ‘Bhe-tndtiak partitioning. method of Hoffer and 
Severance: uses the chieice asulpsloralgerition OF McDermick-et ah: &Pth-which: is heuristic in 
nature, to group the attributes-together into. blacks... The clustering sigerithm takes a set of 
objects and: utilizes a .«neasurg-of “similarity”. forall pairs of the ebjects. ‘It then rearranges 
the set of objects such that pairs of objects with: large: siratlarity measure fall adjacent. or 
ea adjacent to one another. ‘Hence: casters: (er: blocks). of ebjects:can be identified such 
that every pair of objects within: the cluster carries:a datge-messure of similarity, and every 
pair of ‘objects icvoat ctscsole boundaries carries a. small measure .of ‘similarity. Hoffer and 
Severance provide attribates'.as objecta-to the clustering: algorithm: .‘Fhey also develop a 
similarity measure for-any-paie-of vattributes (called the pairwise attribute access similarity 
measure), which ‘expresses the degree: to: which dior jab at atteibutes.4s. used together. in 
queries. The similarity measure-of a pair of ‘attributes :4a: obtained: as follows: A subfile 
cousisuine of only the two-attribwtes is assumed. . When answering a query that: requests one 
or both of the-attributes, the-subtuples of the subfile need:to be retrieved. However, not all 
of the information retrieved:is used for answering: the query: -sorne of the subtuples may not 
satisfy one of the attributes,and:hence.the infeemasion: contained for the other attribute in— 
this subtuple.is of: novus. “Fhe:similarity measure for the-pair of attributes. for this query is 
defined as the ratio of tive amount of wgeful-dase-transferred -te-the total emeunt of data 
transferred. from such. a: subfite.s-The access paar, measure ix derived by. connaer ing the 
set of queries, the frequency of each: individual query, andthe. fraction of tuples satisfying 


each query. 
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The queries that Hoffer and Severance consider can contain only one attribute in 
their selection component. ‘This assumption restricts the applicability of their techniques. 
Also, the only access path that they cysiel nd sequential searching and _ therefore the subfite 


that contains the attribute of the selection component of the uery y ts searched in’ its entirety 


(however, only tuples required for ‘Projection are selectively: retrieved from the other subfiles). 
‘As with the model of Remedy, the criteria by which a Partition is selected for the file is the 
“fraction of useful data transferred from secondary storage to primary memory. Since with 
such a criterion optimality can always be stmined with the triyial partition, as a result, the 


number of subfiles in the partition found by the clustering alg ori mm has to be specified, in 


associated with the chanering aigorithe th that thay use. In Section BS, we. describe some of 


? ‘these problenné: 


The two ‘Approaches to attribute partitioning that have been taken are the integer 


Programming approach and the heuristic ‘approach. | “Most earlier work on attribute 
| partitioning falls in the former category. There are two major problems associated with the 
| formulation and solution of the attribute. partitioning problem in the integer programming 
approach: I- The applicability of this approach is limited ‘caddis cb the undue siriplitying 
assumptions made on the problem environment in order to obtain an objective cost function 


that is amenable to optimization. In a realistic database environment where the file has 
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many attributes and saw yout are made x0 tre database and there ere el access paths 
available by which te ‘eden: the dase, the ‘ewnber ow variables and: ‘cunatraints to consider is 
so o large that it effectively preciudes ae jnmeger pragresmning formulation of the attribute 
Partition proven for weenie ment. Even etermuny wepicxion: are aavermed in 
the database environtient, she attribute pertinoming problem svunty creditces to solving. a a 
zero-one- fiontinesr dnteger’ Progratwming problem to which no available ‘mathematical 
programming technique com Be applied. hs Kenmedy nates (2 no technique has been found 
(short of enumeration) wo ‘solve the trated petttoming ‘probleme that is: expressed only in 
terms of attribute request protebtities and attribute: denagthe. Forawmn where mathematical 
programming techniques are tvailable for solving -the deteger ‘prageermening formulation, 
applying them to even wneeatly sued preblewe-is computationally infessibie. 
‘The siemplifyiog ascuiptions om the :prubbom environenent that have been made by - 
previous studies fatt aa tail in two categories. ‘One isa dernitation on the complexity 
| of the queries that are madie te the datbave. Se re ~arevelther assumed to consist 


of singte- e:qersiley-einiahanieeneir thee deahenee usage s% ee deci ai 


| by. a set of attribute 
access probabtlities = picasa ad an attribute being sequested by.a query. 
Correlations between attrivume-wecurrences = queries aur ignored. The other simplification 


usually adopted concerns 5 ahe-competation of ttn cunt oF 


ring queries in. terms of the 


amount of information that-met-be transferred. aa egard, the effect of blocking tuples 


(and subtuples) inte ia. pectesneae complesty ignore. “Phe 
has the effect of Incremsing chemonnt:of dnfermatian: 


Povek eee 


st tuples into pages 


pt. tuple woCesS. However, 
this increase és rot Hevemx, “sitet actessing any qariver of tuples that reside: on the same page 


will result ift only owe pape access. If these blocking effects are ignored, ‘then the 
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partitioning problem: has a trivial solution. seeierts {Lerma 4, Ro) has shown that when 
the amount of information transferred ts s the sole criterion of the cost function, the optimal 
attribute. uftition is the bers aterfute partition, as described in Sohapter f 1. The reason for 


this is that the total access cost is non-in as the saribwes are Sisperved into an 


increasing number of subfiles, even if the attributes are inappropriately Partitioned. . Hence 
in studies where blocking effects are ignored, in order that the trivial partition not prevail, 


the number of subfiles into which the attributes are to be artition e 


| has to be artificially 


limited and prespecified, 


The approach to attribute partitioning that we have taken in our work is heuristic 
in nature. in the heuristic approach, an stil or near optimal partition is found for the 
attributes by a process of stepwise minimization. An attribute partitioning ‘heuristic which is 
based upon stepwise minimization starts with a given partition (eg. the trivial partition), 
and attempts to derive from it a new partition that is incrementally ‘superior to the original 
one, in that the database partitioried according to the new partition will have a lower 
performance cost. When this is achieved, the heuristic further tries to improve upon the 
newly derived partition. Each time an improved partition is derived, the performance cost of 
the database is reduced. The. stepwise minimization process is continued until no 
improvement can be made to the tatest partici” This fast partition will then be returned as 
the result of the attribute partitioning heuristic. The resuttant partition is not necessarily 


optimal, although it can often be argued that the partition is near optimal. (The near 
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optimality of the partition -propored by the newbie cay. be: verified” by comparing the 

performance of the database management systern-wher:- the file s optimatty partitioned with 

the performance when the: fie is partitioned. ae saggened by the Hewrtstte) Indeed, in the 

course of our experimemasion. with cur aurioane paeutioning hemes, we have comsttety 

found that the resultant partition of the. heuristics is “either opt or aiters id 
insignificantly from the.optimat: partition. 

The heuristic. approach to attribute partitioning does not suffer from the two Danae 
problems associated with the gids programming approach, The model of the the database 
environment may be as corepier as desired. The complexity of the model: a tee does not 

seh a RBS eck. odie Seeger? ous gy 


seriously hamper ‘the “heuristics: abaty to find reasonable soliions to the partitioning 


problem (although it may affect the precise amount ¢ of search tiene | req ‘ d by the heuristic to 


find a reasonable. solution) We note that, athough our model 098, not. consider certain 
parameters that have. been considered by some earlier studies ¢. {He storage cost, overhead 
cost for accessing subfites, different access and transfer costs for. each subfile, and the 
imposition of constraints: on the allocation af attributes te “stubfites), we. could: readily 
incorporate these parameters. into. our model: of the:.da ate matmgement system and: take 
them into consideration. witout needing. to .signdficantly akey. cus. partitioning heuristics. 


(These extensions are descssbad.in. Chapter.3,)- The hauristic approach: is.also relatively more 
efficient with respect. to. the.tiwe needed to determine. #.solution, Far example, the main 
attribute partitioning heuristie. that: we develop:in Chapter 5 operates. in time that is on the 
order of the product -of. the ‘aabrbepeck queries: int: the: uenge: pattera: and the: square of the 
number of attributes. in. the file. Fhis- Compares. Very: eee with execution ume. of the 


integer programming pier 
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. . The model of the database ‘management he dseaaed we wave opted. in this work is in 
™ many ways a a generalisation of seaehed ‘work, and _akhough not a. | model of any particular 
existing system, ‘is more reflective of practical K jspeseed than ae. models. We have atlewe? 
more complicated forms of queries, and have also comidered the effect of blocking subtuples 
Into pages. We allow a diverse set of a access $ structures in our r model, including links, indices, 
and segments ‘The objective cost function that we eek to csi is the total cost of 
anueecing the: queries posed to the partitioned danabase, and is expressed | in terms of the 
number of page accesses, rather than in terms of the amount of data transferred. Unlike the 

models Of previous wudies, two tuples: wivietr tuppen: to ‘rébttld-on the sure page, if retrieved, 
- will net incur twice the eves cost of retrieving” Wie of them: “Conversely; a single tuple that 
has its attributes ‘paititioned, ‘if retrieved’ siecusenalsnnecsat vers ‘page accesses. 
Consequentty, 1F the atributns of w file at pitiined : 
that are requested” together are placed: in separate “vabfled, he’ performance: cost of the 
partitioned database increases. This contrasts with previous tHiodels: where access cost was 


-gueh'that’ attributes 


ceerpined solely in terms: of total ‘information’ transferred fand 0 for which the trivial” 
Partition, is always optimal In our pode we de ‘not need bad speci M, the number of 
subfiles in aie chosen partition. Rather, M bed unconstrained ae = determined by the 


heuristics according to the optimal partition. 


Our attribute partitioning system consists of four components. Figure | shows the 


block diagram of the system, in which ‘each’ componénit ‘appears as a box. The four 
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components are: F the parameter acquis and forecaster (describes m Chapter 3) 2- the 
fite cost estimator (described in Chapter 4), Ss the query procesor (described in Chapter 9), 
and 4- the srribute partnering heuritcs (dexrbe in Chapter 5 and 6. The circles in 
the figure represent the coliaton of forested parame, prepared by the parameter 
acquisitor and forecaster, characterising the database and in wage. ages in the figure are 


labetted by the kind or passed from ene component aether. A A brief description of 


each component follows. 


and the response of the database management. systenn.to-the queries in the usage pattern. It 


At certain points in time. when. file reps tations 
calculates trends. and makes forecasts of the database uage gattern.and database parameters 


for. a time interval into the future. dae 
2-. The fite cost estimator receives a proposed partition : from the partitioning 


heuristics and evaluates it by finding the cost of processing each query in the forecasted 
usage pattern against the accordingly partitioned file To cornpie the cost of processing a 
query,: the file cost estimator passes the query to the query processor for query ‘analysis. The 
query processor finds a methed for = query and returns the methen to des file cost 
estimator.. A‘ méthod for a query is a procedure indicating hew to ra about sieceasinig the 
subf iles in order to answer that query. Using the forecasted Aatabase parameters for the 
future time interval the file cost, estimator computes the number of page accesses required to 
answer the query seleak ne partitioned file according to the query's method. Summing 


Doe agers) tetas lege Re at 
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Figure | Block diagram of the attribute partitioning system. 
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these costs for all the. queries. in the usage pattern, the file cost estimator obtains an estimate 


for the perrorance cost of the propoted partkion — the er would be expected to 


- incur in the future time intervak 


3- The query oer evaluates a query ins a partkione e environment by finding a 
method for the query. It eee the forecasted parameters of the database “ana the file 


partition. The query proceso is heuristic and the method fone iy normaly hear optimal. 


4- The attribute. partitioning heuriatice-prapose a suitable partition of a file's 
attributes. The proposed partition is passed: for. cost satimation: t9 the file: cost estimator. 
After the cost of the proposed: partition is eaticnnted, the Lheurjatisegeempt to come up with a 
"partition that is incrementally superior to thelast ‘pampered partition. This process is | 
continued until a partition. is. found such that no other partition progrsed has a better 
performance compared to.it. If the performance com. of the final partition is less than the 
cost of the current file partition by a margin that exceeds: file Fepartitiening cost, the file is 


repa rtitioned according to the resulting partition. 
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_ CHAPTER 8 


THE MODEL OF THE DATABASE MANAGEMENT SYSTEM 


" In this chapter we will describe the tinderlying model of the database management _ 


system that we have assumed in-our work. We will describe the storage structure and the 


$ 


access structures we have adopted for the physical re ese tition of a relation, and the 


assumptions we have made for. the purpose of reducing’ the problem of attribute partitioning 
to a manageable size. We will then describe the structure of the queries made to the database 
and the strategy employed to process the queries in a partitioned environment. Finally, we 
will list the patamesers required by the components of our atifibute partitioning system (the 
attribute partitioning heuristics, the file cost estimator, and the query evaluator), and describe 
how these parameters are obtained from’ monitoring ‘the’ Operations of the underlying 


database management system. 


1. The File Model 


We have chosen the relational data model as the logical view of data for a database. 
A database in the relational context. consists of one or,. more relations However, in order to 
make the problem of attribute partitioning. manageable in size, we address the reduced 
problem environment of a database with a single relation. In addition we assume that the 
physical implementation of a relation isa flat file.. That is, a relation.ts stored as a set of 


unordered ‘contiguous tuples in secondary memory. -There. are.ng. hierarchies of domains, nor: 
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pointers from one tuple to saan: Akhough ‘the assumption of . flat file storage structure 
may seem rather. severe, we note thet this-is the most. natural way of storing a.relation. Ako, 
some of ‘the drawbacks of the flat file storage structuse, such: as the placement of frequently 
used | data together with seldom: used data dn: the- same physicat-docatity, is precisely. what 
attribute partitioning intends to eliminate. We note, here that: alehough the work reported 
here is based on the assumption ofa single relation database and. ® ‘flat file storage structure, 
the approach to attribute partitioning that. we have. taken and. the attribute partitioning 
heuristics that we have developed should. be extendible to. problems where any of the two 
eecetion are relaxed. Sperificaly, if there is.a: feciity avaiahin: to eximate, the cost of 
answering a query made: wa muti “relational database with 2 eenfBl, file storage structure, 
then the main heuristic techniques that. we haye devstoped tong.be regarded as a. viable 
candidate for the purpose: of. attribute partitioning: . For. further, discussion of the poutiie 

All the subfiles of the stibute partition. are assumed. ” reside on direct access 
secondary Herege devices like disks [6] Storage space on such devices is divided into fixed 
size blocks called pages. A: page is the information quantum sragiferted nek woes. the disk 
and primary memory in one disk access... The accessing cost of a page is assumed to be 
proportional to the average disk sevk and latency. tinnes plats the page transfer time. Hence, 
accessing cost wilt be independent of the sequence of page’ accesses. “Consequently, we may 
think of the page of a file scattered. thwoaghous the-dish, with, ‘fie veatriction.on their relative 
physical focations. a 

As mentioned above,:the tuples of a relation are stored unordered with respect to any 
attribute. The ofdice ee which: the tuples are stored will be tietr chrobdlogical ‘order of 


ee caes 
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insertion into the file. This makes the problem of file maintenance due to updates, insertions, 
: and deletions much simpler. if a tuple is updated, the new values replace the old values in 
the same cui: A tuple that is deleted is joined toa pool of deleted tuples and will be reused | 
for newly inserted tuples. (Such a pool can easily be maintained by threading the tuples that: 
were deleted, into a list) A. new tuple that is inserted in ie file replaces a tuple that has been 
previously deleted. If the poo of deleted ‘tuples i ornpty, then the inserted tuple is appended 
to the end of the fit (if the file orcuples an inegral number “ pages, 2 a new page is allocated . 

to the tite), . . | 
| | The above strategy for overflow handling is ——— to maximize the number of 
undeleted tuples per page, and keep the file size toa rnin Pag cost of a. sequential 
search and tuple retrieval by the link access path (described below) are inversely related to 
_ the (average) blocking factor (the average namber of used tuples per page) and these costs 
should ‘be minimized by keeping storage utilization a the tuple space as high as posse 
Even with the above eal dating poor storage utilization may still ensue if the database 
usage pattern consists of a large number of insertions followed by an equally large number of 
deletions. To. correct this, garbage coflection may be performed on the tuple space so the 
tuples are recompacted to occupy as little space as possible. We note here that in 
partitioning the attributes of a file, the cost of garbage colfection may be eliminated from 
consideration and that if we vkee the effect garbage collection has on the subfite blocking 
factor, the optimal attribute partition is independent of the garbage collection cost. The 
reason for this is that no matter how the fle is partitioned, garbage collection of deleted 
pies requires that the entire file be brought. into primary ery and shipped out back 


“onto secondary storage. Since the total amount of storage is fixed regardless of how the file is 
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partitioned es for'pabe'reskage at the end: of each subfile, which is + negligible), the cost 


of garbage collection does not enter the optimization process. ‘On the other hand, the 


Sea) & BOERS Laie BE! Peto PIR 288 2S 


blocking factors of the subfites do influence the optimal partition “The more frequently that 


ae Rh abe Shee Bee tee LP wR Ret ey awe “BP, eee Bs feta 


the file is garbage “collected. the fewer the number of unused tuples Laid page and the large 
sof Sogn ae Sas 3 Sipe SH iat a 


the blocking factoe ‘would be on 1 the average for the file, Tharolare. i the optimal partition for 


ie BY RES. 3a uk ae Ny arte of abe ahd 
the file will ‘partially “depend 0 on | the frequency ‘of garbage collection. Since ‘the ‘optimal 


om Set kee a Seap eaga yal ad. PP art ij dy ty 


selection of points ‘where the file is to be garbage collected is in itself another database d design 


optimization problem, we will not consider the problem of the optimal determination ‘of 


* REE . SEBO 


garbage collection points. "(See ‘the works of -Shmelderman ‘te and Yao et al. 38) for a 


Be ocap bays Hee oe ge WP Rages aes beng 3 
discussion of this Problem) We “it assume "that the subfite blocking factor that “the 


weouell sey, oe wa Bebeindings 


pa rameter acquisitor prepares for the aterbute : partitioning heuristics 1 is the overall a average of 


hg dt 8 aes ce PSR Gee Yar 1 Risks et 
the observed blocking factors: throughout the planing novtzon. 
hela gel oe eee a eat “s 5 Fb, A kt eit te HAs Be eh 


We will assume that tuples are ‘of ‘fixed bngth te each tuple occupies the. same 
SEH ap Rag h.. ay Perch K S 


“ammount of storage space), so b that each page ‘has a ‘capacity for a fixed number of tuples. This 


Bie Seed te Bt TOS whe SSB 


implies that attribute ‘values have fixed ‘alzes, since a normalized restate has a fined number 


YER 


of attributes per tuple. This anéoniption is in ‘correspondence with the relation a being a flat 


lie duades pee 


‘file and is a a necessity for the ‘realization of ‘hoks between subrapi of the 5 same tuple. We 
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also make the assumption that ‘each page contains an an inegral number of tps, and that 


tuples dé: not overip page boundaries.” 


Pres 


“In our file model, we allow three ‘kinds of a access structures. "These are: segments, 


eae pitae igen? Mie Ee. jedi. Sa. ¢ 


links, and indices. Ana access $ structure is a  mechania that canes the search and retrieval of 


UB Seb LN) 


tuples possible in other ‘words, given the value of an attribute, an access structure can locate ; 


and retrieve ‘ail tuples ‘having ‘that vahie for the attribute. The access is path ‘of an access 
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structure is the way in which the structure is used in such a search, A sere isa file or a 
subfite that may be retrieved into nen memory and pees searched from icp to 
bottom for tuples with a certain attribute vahue. Hence by using the sequential search access 
path of the segment access structure, we can both ‘tocate desired Auples and retrieve these 
desired tuples at the same.time. A link is an-access structure for retrieving: tuptes which have 
already beer located. In other words, assume that we have ® pointer or some other identifier 
“that uniquely identifies. a wple by its location. Linking: is the access: acl for deriving the 
physical idaren of the tuple from the identifier and: retrieving :the taple: from secondary 
storage. Therefore the link. access aceaare isa mechanism for: retrieving: tuples that have 
already been identified and-whese-location is known. A link cannot be used to search for 
tuples that possess.a certain value (content setrieving). In our. file model,.each subtuple of a 
subfile has a link to all its ‘carrenponding.:surtuptes boa ‘the other subfiles.. The 
corresponding subtuples (or co-subtuples) of a subtuple.are all. the subtuples that made up a 
single tuple before the file (and-the tuple). was partitioned. An-index is.an access wvaciice for 
locating subtuples with attributes that match certain values; without actually retrieving the 
subtuples. An index does not have the capability of retrieving tuples. In order to retrieve 
the tuples that have been located by an index, a ink access. structure is used. In our file 
model, any attribute of: the relation may. have aniindex; which ones are actually to be 


indexed is.a separate database design issue. 
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Sequentially searching 2 fle-(or a subfide) is. stenightiorwatd: matter and we will ‘not 
discuss it any further: A tinks-on: the: ethen-hand, te-ary neocon steuatube: that da: widely-used: in 


our mode} and. we will describe haw: linking te performed in detains a 


Onee a. ae haa been located: ow reteieved mast be: possible to rétrieve-any of | 


. its a cublenes in the other eubfiies: (A: Lconmmbtuple may haves be nitvteved 4 int order to see 
if an attribate of. it contains:© certain: value, sor’ tm erdar: to peejedt’ che ‘of ‘Ms: attributes.) 
Hence, we assume that each: sulmple: has. inks 20: ait steep ettbeuples “drid that:‘the 


co-subtuples may be retrieved by: linking. Links! ae-2neane°of veluting teples may: be 


classified according: to: purpose; rentiaashon;: mel cuvcliaiitiey:: “The parpose of a firtk in our 
model is. to relate co-subtuples. Bs realization: te:-togieal; ith the tink is derived ‘by 


transforming the-addeess of one co-subtuple-intocthe ankdseed of anbuher‘tine:: Fhe cardinality 
Of the liek is one-towne, ke: each subtephe ts tnhedh touieetly Une sititaple-in: another subfite. 
‘Thus, there is ‘no. explicit: pointer froma sabtuple of 4 sebttie'e each of ‘ts ‘co-subtuples. 
Rather, the address of. the co-subtuples in different: subfites casi? be* calculated ‘from one 
another's addresses; By: subtuple: of coaubtuple address selaiiindin ted identifier {or the. 
logical address of the tuple), which 4e:the address of the-tuple retkive to:the base address of 
its file. When retrieving: a subtuple by using 2 lnk, the subtupie's tuple identifier (TID) is 
translated by the file’s page map table into the physical address of the page the eee 
resides on and the offset of the subtuple within the page. The page wdclress is then used to 
retrieve the subtuple from secondary storage in one page access. Note that when retrieving 


any number of subtuples that reside on the same page, the page needs to be retrieved only 
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_ once. : Once v we have the TID of a subtuple ina  subfile Ane. Pappas when ‘the subtuple 
has been located or retrieved), we may obtain the TIDs of all ‘co-subtuples in other subfiles by 
- applying a transformation to the subtuple’s FID; This hasbeen made: possible because when 
Partitioning a flat file implementation of bilien aN, co-aubtuptes retain their relative 
‘position in their subfites, and. also because within a:subfile al:subtuples are of the same size. 
To see what this transformation is, tet at ‘de the tupte with tuple dumber 4 (ie. 7, is the ith 
tuple of the file). -4f the:file is. pastitioned: into: MW eubfiles, then ;), fig, —-- Tim are the M | 
co-subtuples Let ty be the TID of subtuple #5.) <4; mci We want to caboutate tin 
the TID of Ty. from ty ‘the TID of T} Wesfirst shew how to- get the tuple number i 
from the TID ty. Let § be the system page size, |, the. subtuple fength:in subfile F,, and 
let b) = LS/l, J be the number of subtuples: per page in: F,(we have assumed that tuples or 
subtyuples do not cross page boundariesh then: <Lty/S J is the page number of Ff, and 
(ty mod S) is the offset of ry in its page. Thetuple mumber i is therefore: 
(B21) | i= b) Lty/S. ! #{ty mod S)/i, 
Finally, we want to calculate t,,, the TID of Pe Hom i ie tuple namber. Since Li/b, J 
is the page number for the aes Fie in ne F and (i mod by) is the number of the 


subtuple within the page, we have: 


(3.2.2) tin #S Life, J + KO (i mod b,) 
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_ Another. access srecture: we have: considered i our model: iad p tnd: A file’ may 
‘have an. index. ~ one or. ore of is stitutes. A, potas contain wm’ Indened ‘attribute, 


has indexing as an access: ns pats to foeate sulbtaples having? specie’ value for that ‘attribute. 


In our. model, we have: chosen an index, to be m balanced tree; where euch Rode of the: tree is 


a page. A detailed description: of the index’s-strecruve: may be toond ‘m Chan [and 


Blasgen and Eswaran (6). ihe index is very -aipallion: honed Batearct, Bayet and MeCreight 
[3}. both in terms of structure andthe omy & 1 pralnentied. “Briefly, each norr-leaf page OF the 
index contains an ordered: sat of ‘pairs of: kepe act ats) an: oieters; eat Pointer 
pointing to nodes..in the nent: lower: inca aniiaaea nes The hep insthe pablsithe highest hey 
of the nade the. pair paints. - uA teat page conde achey elie bya uardered: fst: of 
_TIDs of subtuples: that have thetey as she vee of the indexed altrioawe bwithe subfite: The 
| choice of index Bructite for ow. work Af: obviously ym — to balanced trees. Any index 
ae lends iat to eae cost analysis : ane which is Independent of the choice of file e partition 


2 witb oo: Fane SEBO. OF AR GAGE 


may oe Se alternatively used. 


pee Bs ai: “ * ga $Seeehe de Glos Gaed aoetedgeht a Ct: biedde. lat aa ot 
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a1? concentrate on the problem o of — partitioning, we assume that, the cuore of 
ge Sate SW PSE TE 3 ang 


indices and their structure is predetermined and — beforehand. on the basis of ether 
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criteria besides the file partition. This is not to suggest that the problems of index selection 
and anribute partitioning. are eee bad one Fvincigal Indeed, the two problems are . 
mutually niereepencen = > Amel > had the atribute patconing problem can be 
achieved by their. simatancous solution. _The problem of peat indices that befit a 


database usage pattem has been extensively analyeed by Chan m Our work on attribute 
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Partitioning takes up another dimension of the general problem of physical database design. 


4. The Tra: Lo 


We will consider ‘four types of transictions ‘thet may be conducted against the 
“database: queries, updates, insertions and deWtionn The query and the update transactions 
consist of two components: a Selection component ‘that determines the tuples that are to be 


| _ selected, and a projection component that deteri of the selected tuples 
are to be extracted and returned ’(in the case of ‘& query)‘or updated (in the case of an 
: update). The deletion transaction consists "oni of a selection component that determines. 

which tuple have to be deleted. The insertion transaction has no ‘cdmponents. An insertion 

transaction is basically a ‘set of tuples that have to be ‘indented in the file. Because of the 
similarities among query, update, and deletion trartsactions, ‘henceforth, we will discuss only 
one of them, namely queria: in full detail. The reader shoriilaiscnss that the discussion for 
queries can be generalized for the other transaction types as well. ‘The ae difference among 

the transaction types is ‘how the Projection component of each transaction type is processed 
after the eaptes are selected. This difference in processesing the projection component will be 
‘delineated later. 

We have made certain simplifying assumptions on the ‘Structure of the queries 
considered in our model. The itmmplifications were necessiated by the need to reduce the task 
of query cost analysis to a manageable size. We have disalléwed join operations on the 
relation in queries. The boolean ‘expression in the ‘selection component of.a query consists of 


either a conjunction made of equality conditions, or'a disjunction made of equality conditions. 
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A query with just. one equatity condition is considered to be a special case of a conjunctive 
query. An equality condition, is & predicate of the form (a =x);;where # is an attribute 
name, and the attribute valve » Of the equality condition ba constant or program variable 
which is known * thre time the query is processed. “The equaitey: coheticion ffrthe selection 
component is used to search for all the tuples (oubtipes) in ‘the file: (subfile): that have 
aeeibts value x for aterfbure a. The projection component: is a set of attributes whose 
values are extracted from all tuples that satisfy the selection component an returned as the 
answer to the query. Ina conjunctive query, an attribyte cannot appear. twice in the selection 
component, or appear both in the selection and prajpction component. Although we have 
restricted the set of allowable queries by the samreptions. presented above, we have still 
included a — number of possible queries, pape a many of the more frequent queries 
encountered in practical eater appticattons: 

When a query is made toa ee the query Labepings does the necessary search 
and retrievals on the database and returns the answer to the query. T here is a cost associated 
with processing a query. In our attribute partitioning, system, we have incorporated a query 
evaluator and a file cost estimator that can analyze a given query and. provide an estimate of 
the cost of answering the query. Query cost analysis is.a one task. The assumptions we 
nave made on the sindctors of the query alleviate some of the difficulties in query processing 
and query cost analysis. Beskies the assumptions on the structure of a query, query cost 
snsiysti depends on the assurnptions made on the distribution: of attribute values in the file. 
Query cost analysis also depends on the distribution of attribute occurences in the selection 
and projection components of ucries end: the-dietettretton of attribute values in the equality 


condition predicates of queries. As we have mentioned in- Chapter 2, previous work done on 
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attribute partitioning made simplifying mpporapllgs on the distribution of attribute values 
and ‘on the distribution of attribute requests in order to sheep the peer. of query cost 
analysis fand nee ag attribute partitioning problem) within manageable limits. We have 
also made simplifying assuraptions on the distribution of ——, values and attribute. 
requests in building our model of the database “management synem. However, our 
simplifying assumptions are less restrictive in nature than those made in the works of our 
predecessors and are closer to the realities of practical database usage. We have made the 


following two assumptions in our transaction model. 


I We assume the ieeaios of tuples that satify a one Predicate selection is. the 
selectivity of the attribute in the equality condition. The (average) selectivity of an attribute 
of . relation is the average fraction of tuples sar consideration that have historically 
satisfied an equality condition involving that attribute. In. other words, the selectivity of an 

' attribute is the fraction of tuples that will most probably satisfy an equality condition on the 
| file. The concept of an attribute selectivity measure is an Important tool for database 
modelling and query cost analysis. The attribute selectivity measure will be defined and 
described fully in the section on Parueiaee Acquisition. From the attribute selectivities, the 
number of wee that satisfy an equality cashed on an attribute is estimated as the product 
of the selectivity and the number of nips in the file, Using this measure of selectivity 
avoids the naive assumption that the attribute values are uniformly distributed: in number, 
and that the number of tuples satisfying an equality condition is the total number of tuples 
-. divided by the number of different values of the attribute. ‘Also by using this measure, we 


have avoided the simplistic assumption that attribute values of a given attribute occur with 
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equal probability in the selection components of queries. Although we could have obtained a 
" still better model of vatue distribution by noting the number of tuples that contain each value 
of an attribute in a table, the attribute selectivity measure has the definite advantage that it 
takes little storage fer. its preservation. The other scheme requires that a 1 table: of attribute 
value f requencies be maintained for each attribute in. the fite, and if there i are many distinct — 
values for an attribute, thls table will consume a significant amount al storage and will also 


be very difficult to update. 


2- Since ‘we allow the specification of queries with multiple equality condition 
predicates, it is necessary to have a measure for the joint resolving power of two or more 
equality conditions. (This measure is called the joint selectivity measure) For this purpose, 
we will assume that the appearance in tuples of values belonging to ee rerent attributes is 
independent. (Le., the probability that value x of attribute » and value y of attribute b. 
appear in ‘the same tuple ts equal to the product of ther individual proatinies of 
' appearance.) Hence the fraction of tuples satisfying a. ‘conjunction of predicates 
simultaneously is the proeeet of the fractions that satisfy each predicate, and the fraction of 
tuples satisfying a disjunction of iraneat is the complement of the fraction not se 


any of the predicates of the disjunction. 


One scoala we do not make in our model, however, is that attributes occur 
indepenicenty of one another in the selection and projection ‘components of the query. 
Neither do we make the tas nenew but nevertheless still restetotive assumption that the 
correlation between attribute occurence in queries is determined. by joint probabilities of 


attribute occurence. We actually keep a record (in a table, called the table of query types) of 
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, all queries made to the database, and the exact correlation in the occurence of attributes in 
) queries may be obtained form this table. Thus we avoid making the. strong (and often 
Amaccurate) assumption that an attribute is requested by a merry. independent of what other 

attributes are Tequested by. that query. This _ table of ess is concise in that queries 
; involving. the’ same attributes but different acroue values are clustered together in one 


entry. (The number of queries in the cluster { is also recorded, in the entry.) 


5._Query_ Processing 


An. integral part of a diatsbued peaageent system is a facility to decide how to 
answer queries. Since we are modelling a database acme systern that decides how to 
answer queries posed to the ‘databaie, and ‘since in the course of ‘attribute partitioning we 
need to estimate the cost incurred in answering a query posed ‘to the model database, our 
: attribute partitioning system will also need to decide how to answer queries. When a query is 
made to the database, appropriate access paths must be chosen 's0' that tuples satisfying the 
scion conhen ol the query may be toca Ater the satisfying tuples are located (ie. a 
TID list of such tuples is obtained), the same access ath (or possibly some other access path) 
will have to be used in order to retrieve the tuples. 

For example, assume we have a conjunctive qiery involving attributes a,,...... a in 
the selection camnpshient and attributes a,,;,.... a in the projection component made to a 
partitioned file. In order. to answer the query, the subtuples that satisfy the equality 
: conditions of 8}, is a need to be located (by creating a TID list pointing to the subtuples), 


and then their co-subtuples ‘containing attributes 91,4)» &% have to be retrieved so that the 
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value of projection attributes 8,1 5...) *« ray ‘oe entrant and veterned. Assume that 
there are indices available: or'some ofthe attributes | ape ocbenip o : (Wemay preceed to: focate | 
the subtuples that satisfy the selection: component in ether of the two: following ways (in the 
rest of this section, we des mat explicitly specify whan te transformations 92-822 are-to be 
performed. ' on TIDs of subewples. to get the: Ti: of. pamemaed ‘we assume the 
transformations are performed whenever pecensary) | 

‘Ie Use all the applicable. indices to retrieve the: FID tists ‘of subtuples eying the 
¥ conpunetive), ‘and fron the 
resulting TID list link to the subfites that comalnany of the unindened attributes 9; .-, 2 


indexed attributes; intersect these THD Hots Couenie:the gu » 


(an applicable index is an tedex on »-selection: sttribate). ‘Siubhites are accessed one ata time. 
Everytime a subile is accessed, te subeaptes thy THD tn att oe cme via links) 
and checked to see if they satisfy the erponliny: conmcianen: om the unindexed attributes. ‘The 
‘Tips of sabeoplers that do wet: satisfy: any of. the wnindened: attetouens: ‘srethen: ‘pruned, front 
the TID list (ie, the TID tist of the subtoples that satisfy ait of the enindexed selection 
attributes in the subfile is intersected with the old TID fist). After all the subfiles containing 
selection attributes have been-aecessed, ard the TIDs tn the fist have been pested to satisfy the 
equality conditions, then all suttuptes with Tides in the: et: are:-retrlewed (again by linking) 
from all the subfites containing: Pree, wetributes, and the projection attributes are 
extracted. 

o Use none of: bes eco Sequentiaty senen one. of the subfiles containing 
selection attributes, and create a bie = of the subepe, tha sath all predicates involving 


the subfile. Thereafter, using the TID list, Hick one hae one. to. the panties containing | the 


remaining selection attributes, until a TID list of sap satisfying the entire selection 
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component of the query is obtained. Finally, link to subfiles containing projection attributes. 
| Each of the above two schemes may be thought, of asa step by step procedure where 
at each step an access path (sequential searching, indexing, inking) 1 is performed in order to 
obtain ae eID a of sueter lS that satisty one or more of the equality | conditions in the 
selection component. Let us call the act t of obtaining TIDs of subtuples that satisly. the 
equality condition on an attribute the act of resolving the attribute. Hence each of the above 
two schemes is a step by step procedure where at each wep ‘an index is used to resolve ain 
attribute, ora scipeenttal search/tink is used to resolve one or more attributes in one subfile. 
We define the method of a query to be such a sep by step procedure where at each step an 
| ace path is used in order to resolve one or more attributes o 
a A aaey usually -has many different methods. For example, in the. two schemes 
above, we chow ether to us use all te indices, or to we none. . We might have chosen to 


resolve some » of the pened attributes (in the selection 1 of the query) by indexing, 


while reson ing the rest of the indexed and unindexed selection attributes by linking. 
Similarly, when linking to subfiles, the subfiles will be accessed in some sequential order tie. 
one subfile is. linked first, another subfile second, ge), Each distinct subset of. applicable 
indices and each distinct subfite Sequence. constitute a method. of the query. (Hence each of 
the two schemes above may be translated into many methods as the sequence of linking to 
subfiles is instantiated.) | 

‘There is a cost associated with a query’s rhethod: “Depending « on what indices are 
used and in what sequent! order the subfites tre linked, ‘the coat’ of answering the query will 
be different. For example aisume that in resolvinig the attributes of a° query, two subfiles 


have to be inked and when each subfile is linked, the size of the TID list will be reduced by 
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R equal factor. Ten it is etter 00 tok $0 the aubtie with the ameter surber of pages 
before w e link t 0 the subfite wlth the Jarger number of gages. Althenagh in the first st method 
the second fink wl sed nme page sc tan sd te method, the 
first link in ¢ the second mated wit et evn man page seus sham tee ft Hnk Jn the 
first method. | Theretore, it ‘ds tenportant that a ery paver comer ete methods o of a 
query and s tect the mathe which rads i the ik umber of page accesses when 
| ing the query. The optimal méthed of a query will depend on the 2 attributes in the 


__ séfectton component ‘of the query, the attributes in the projection samponent, the att ttribute 


selectivities a and lengths s the atribute partition, and on ether datahese paremeters. ‘A query 

processo to consider all these parameters sian chapeing a snathed for @ query. 
The © purpose of this section fs to present ew our atinibute partitioning system gor 

about choosing a method for the query me to the parthloned datebase. Before we present 
rategy for choosing seethods, we delineste and describe the difterant phases 9 of query 


ee the fire phase tthe phase im which the query procewor cies on the e optimal 


method of the query. 


. aa: Ereceane * query ronde against 2 partitioned database with single relation 2 and in 


0 joins or. seeregne opeatrs dp involv coms of shrer phase I- query 


evalua sion, 2 query reson, and 2 quryanewring 


_ - Query evaluation - Query eva uation is the process af finding the optimal method 
for a ae In an environment. where se fle. Js pasttioned. and iputes ate 4 

: tant I- selecting the indices to use in answering the query, which could -be selecting all, 
none, or some of the applicable indices, 2- selecting she sequence af sccetsing. thove subfites 
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that contain selection attributes not rere by | means of inden, (Note that if no index. is 


: Seek y at 


utilized 1 by the method, then the first subfite of the method ia be sequentially searched, 


while the Soraliing subfites wilt be ‘inked The agent for finding th the optimal method for a 


UNEP ag SOS Ee 


query (or ining a suitable meee Le case {the optimal method is ifficuk to find) is the 


me evakiator: The query erahato choowey a method we ~~ Objective « of ineuning 
. page accesses $ when answering the « query. Later in this mien, we wilt present the strategy 
used 1 by the query cyanea of our areata pring hi proves The method chosen by our 
query evaluates is not necesaily the optimal method for ihe query, although we will tic? 
that our ir strategy results in near-optimal methods. When a satisfactory © method is found for a 
query, we say that the query. is evaluated. 

Ales that in our model of query evaluation, the query, evaluator, does not take into 


consideration the projection attributes of “ query. bicerud spraking, the tt query, evaluator 


should: also take the projection attributes into consideration in the method of the query 


epee tis 42 


ain Projection attributes. ‘This is 
- because the cost of answering. ‘a query is influenced by the : we wy 


fey gee 


should specify the sequence of linking to the subfiles e 


. in which projections are 


made. For example, a a subfite contains both selection attributes: and Projection attributes, 
then i it is beneficial that this subfile be linked {ast in the method; since Ifthe subfile Is inked 
last, both the selection attributes may be . ested and the _ Projection attributes may be 
; projected concurrent If this. subfile is not Wntked ba, i wit be linked once for resolving the 
selection x attributes and anor time for projecting olay propeyen 8! attributes. (Note that each 
time the subfile will be ake from a aMrerent TID list.) We have eliminated projection 


attributes from consideration when evaluating a query. bide do this Pacaune: I- considering 


projection attributes will snake the problem of query evaluation still more difficult, and 2- 
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- because we believe that the sequence in which the sublles ace nme resolving attributes 
has a more. » profound influence on the cost of soavaring 2 mr than the sequence in which 
the ‘subfiles are linked for projecting ateribuyes.. 2 

The query ‘eatin pected by oor qumry omer dons net entail any 

‘input/output operations we. page accesses). “The query evaluater does not need to know 
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about the actual data contents “of the subfiles it = requires the various. parameters 


2a en were 


prepared by the parameter Aequistor “The qu query evahiator evaluates : 2 query by chocsing a 


method for it, utilizing s some strategy One ‘uch ‘srategy is exhaustively enumerating ‘al 


ee pat sg att, eC 
possible methods for the query, estimating the. cost oe answering the es according to op 
idave cewbigen 46 te 


method (by using the file cost estimator), and then ‘choosing the optimal method. “We have 


ae Beecher cisie: FE PARE 


discarded this strategy because it ‘is “computatlonally intractable to consider al possible 


nS REEL Pe bers Saget 


methods for a query. “This is ‘especially tne = making a query against 1 . 'detakese 
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partitioned into many ‘subfites ‘and/or if there are meny indices available. The sraegy we 


the geese aE Oh SRNR CEEEy et 


“use for query evaluation 1s “insead based. “ton chung a a method without 
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requiring extensive » analysis « of the query. 


~ Query evaluation: ‘is ‘the ‘only phase of query processing that is an p optimization 
wig gyal TRAD Dai 


process. The other two pbs of query procening do nt atmpt fo opimize the cont of 


: pile neripten 07 
processing a query. “Query e evaluation s the only phase sctuaity aie! in. our attribute 
sore Ua Ee ggterg etc pena 


" partitioning s) system. The next ‘two O phases 1 are ‘only performed by a databace ‘management 


system aver it actually processes a query. “The reason our suribute partioning system 
evaluates queries is. that. the. method. of a. query. sa reed in order to estimate the cost of 
answering the. query. The query evaluator our enter pple the mato of the query to the 


file cost: estimator, which. computes. the cost of. ctl: the selected suetuplos (according | to the 
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a: method) ‘and the cost of retrieving the subtuples needed for projection. (The attribute 

partitioning heuristics require the cost of answering all the queries. in. the database usage 
ae UT Bases yo ages eT ge ee : , 

pattern. See Section 4.4 for a detailed discussion on how the file cost estimator estimates the 


cost of answering a i 


= Query resolution is od process of locating | the set of tuples that satisfy the 
“selection component of the query. A query is resolved when at the selection attributes are 
resolved and a list containing the TIDs . all satisfying tuples is produced. After a query is 
evaluated, the query is resolved by accessing the indices specified in the query's method and 
performing the link to the subtitles In the order specified in the query’s method. In each step 
of the method, the access path specified in the step | ts. actually performed and a TID list is 

_ created of subtuples that satisfy the equality condition predicate of the attributes that are to 

* be resolved 7 that step. For ‘ conjunctive query, this TH hist, is persone? with ¢ the (old) 

Z | TID list: that is the result of the preceding teas! of the resolution process. If the query is a 
disjunction, the union of the new TID list is taken with the. old ip list. The final TID list 
obtained from the last step of the method is the result of the query resolution phase. In the 
"process of query resolution, page accesses are made to secondary storage when performing an 


access path. 


_ 3+ A query is answered when all subtuples tevitaining’ projection attributes that are 
pointed by the TID list are retrieved into primary memory ‘and the ‘attribute values of 
attributes specified in the propetion ‘component of the query are extracted and returned. 
This phases of query processing vee only anprouspet operations and no sisi 


oe As Sipe ey mentioned, if the last oealied that is linked in the resolution phase 
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contains projection attributes, then it ts possitrte: 10 nate amcmerigthe query before the query 
is completely resolved by extracting: the projectenateribute values froma subtuple when the 
selection attribute values’ of the subtuple’ satieties the: prediente:- In:other words, the query 


resotutiori and the query artowering phasev way overtap on the-tast’subtite-in the on 


In the rest of this: section we wr dacs the query evaluation strategy ‘the cay 
evaluator of our attribute partitioning: ‘system uses. ‘Finding’ satisfactory method for a query 
in a partitioned environment 4s. an fnvotved tak. Unlike query evaluation in. an 
unpartitioned ‘environment: where: the: very evahumtor: ice oh ‘0 choose the optimal set of 
applicable: indices, a query sevalunter ina partkioned- environment dmaddition has to choose 
the sequence of: linking: to the subfites. Our: query vette ne. “heuristic. evaluator that 
finds. a satisfactory 1 method: for: the query: without resorting to! vost-estimation. ‘The method 
obtained by the query evaluator. te ret necessary the primal method for ‘that query, 

although (inthe conrse of our: work) we have toad: eto. ‘be: byelbstice si We will first 
| discuss the query evateratton strategy ‘for compote quai. Therenfter we discuss in what 
way the strategy used for dtagunetive queries is different. 

Query evaluation comuista of two stages. in the first sage, the query evaluator selects 
the subset of applicable indices:to include in the method. After this:has.been determined, the 
query evaluator hixs'to- chews anquegain! cede Sor. aking fv:the.subtiien that contain the. 
rest of the selection attributes. ce a ee 

t- Depending on the attributes in the selection component their selectivities, and the 
attribute Partition, it oy be beneficial to use ws none a all, or a subse of the applicable indices. 


We believe that for most queries, using either | none or ail of the applicable indices wil lead to 
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satisfactory methods: Also. in sabais to = the problem | of ey evaluation toa 
mariageable size, we wilt restrict our attention to. the above two a choles, 

One criterion by which we may Judge the eectivenets of cing the indices to 
pieces 2 a query is the Joint selectivity of the indexed aries oe occur in the selection 
component of the ae Assume that: I the indexed attributes are not ‘jointly selective (i. e., 
the joint resolving power of the indices is low and a large fraction of the tuples will be 
selected so that almost all the pages of-a subfile that is linked thereafter have to be retrieved), 

and assume. that 2- a subftie that contains an Indexed attribute atso contains some other 
siindeed selection attributes. Then such 2 subfte wil mott fikely be accessed in its entirety 
in order to resolve the unindexed attriivdbe: °F Neriefoie:: the Ineionssa “attribute in the subfile 
can be resolved by the link at the samme time the: utlindexed attribute ts being resolved and 
with no extra cost. Hence whel the indexed’ tributes are riot jointly selective, using the 
indices will not save in the numberof pages‘accessed. eae | 

Thus, when the joint selectivity of the indexed attributes is not too low (which is the 

- case for the great majority of queries), the query evatuator will choose to use the full set of 
: applicable ridicen: ‘This is because the cost of resolving ‘an attribute Utilizing an index (if 
available) on the attribute is usually a fraction of the cost of resolving that attribute by 
linking to (or sequentiatty searching) the subfite containing it. This can be true even If the 
subfile containing the indexed attribute contains other unindexéd selection attributes and has 
to be eventually ‘inked: If the indexed attribute and the unindexed selection attributes 
| residing in the same subfile as the indexed attribute’are fesolved simuttaneously by linking 
from a TID list to their subfite, there may be more pages accessed than when the indexed 


attribute is resolved first using the index, the TID list pruned and reduced (as the result of 
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the indexing), and- ine the subfile ftnked to Fev the eieheiiee webictibar attributes 
Whether all. the. applicable indices: are:uted or fone of the applicable indices are 
| used, the query evaluator will have te-theose a uence for Pfinking to bubfiles containing 


unresolved selection attntbutess: Pits: Ws tome e ” ne ety enon. 


Bu: 


2- The second stage of query evaluation n bain when the indices: that are to be: used 


have been chosen. The query evaluator wi hen have to ink to pre ene somtaining the . 
unresolved selection attributes farting. from, the TD tat sealing the resale: of the indexing. 
Everytime a subfile containing, an senate wlslon:wirnte is: inked, the: TD: ‘list is 
reduced toa TID list of baples that satiaty he selection maributes inthe subfite in: addition to 
" the previously resolved attributes, The subfles containing unresotved selection attributes.are 
linked in 5 ia producing successively:.more. refined. TID. Hees. When: all the subfiles. 
have been linked, the wey is resolved. and the TID at paittete ter. selene subtupies. The 
task of the query evaluator in. this singe of query evaluation i ta. find the aptimsal sequence of 
linking to. subfiles. Note that the query evaluator does.not actually performe-the linking. -The 
query evaluator only decides on the sequence. of nking. to the: sabfiles: It inthe! query 
resolver that actually performs the Making. {in Khe pagans decisied: al the: query evaluator) 
and retrieves the subtuples irom the = Ten quay evaluator mor treed: ta know the 
expected cost of linking ¢ ma. subiles when: deciding om the sempennce.: An restionate. of ithe 
expected cost of - linking to subse can. be obtained without actually pecforming the linking. 
In Chapter 4 we describe the function wae, al this et. mxtimmation. dL iges anaes transiates 
the number of tuple retrievals: into the number of: page reteset and only pequies the size of 


the TID |ist from which the linking is performad. The olze. bai the TID list is readily 
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avaltable as the product of the joint selectivity of the attributes resolved so far and the total 
| number of subtuples in the subfite (the joint. selectivity of a set of attributes is obtained by 


expression 3.6.1 from the individual attribute selectivities). Alwo note that if no index is 


chosen in the first stage of query evaluation, the e. first subfile. ot the. ce is sequentially 


ati acd 
searched (which is tantamount to Waking 9, the subfije. from 2 FID. Mist ‘cantaining all the 
__TIDs of the subtuples in the subfile). 

. The criterion for optimization in this sage of avery evaluation is ‘the minimization 
of the total number of page accesves when answering the query. Depending on the sequence 
chosen in this stage, the method of a query may be Optimal of, highty nonoptimal. Therefore 
it is important that the query evaluator use a query evaluation strategy. which guarantees that 
the sequence choten is clase fo optimal for most of the queries evahuted. As we mentioned 


before, exhaustive enumeration of all Nl possible subfile sequences (where k. is the number 
of subfiles containing unresolved. selection. earn is out. of the question because cost 


estimating all of the sequences, is _ Due to. the large search space 


(of possible sequences) and the numerous parameters that have to be considered in choosing a 
. sequence, finding the optimal subfile sequence is a difficult task. However, we may 
qualitatively arrive at desirable sequences by considering the following criteria when deciding 
on the subfile sequence. I- Subfiles that can have their selection attributes resolved without 
incurring too many page accesses should be linked prior to linking to subfiles that incur 
many page re Ne is, at each step where a subfile is to be nike: the query resolver 


Should link to the subi that results in the ptiage) number of page accesses. Equivalently, 


this means inking to the subfile with the largest blocking factor (number of tuples per page), 


since the subfile with the largest blocking. factor. will result in the fewest pages accessed. (To 
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see this, we refer. the. ender ta the discussion of the page access fine lon presented in Chapter | 


4 and expression 4.2.Lb. In this expression, for fixed nand fixed ry Alnor) ‘monotonically 
decreases as. b iene 2 The subfite thas make the jaint selectivity of the resolved 
attributes.become highest (gost selective) should be linked firs. Tha 

selectivity of the unresolved selection attributes of gach subfile and sele 


tHe subfile with the 


highest joint selectivity (i. the subfile that reduces the TID lst the most) to be linked next. 
In this manner, the overall joint selectivity witl tend to become ‘high ‘as early as possible, 


causing ‘the TID list of satisfying subtuples to be, reduced sarlier and fewer page accesses to 


be incurred.as the query. resolver | goes to the het Heh Of the method. “The above two criteria 


can be ‘conflicting requirements. A subfite may have low — factor ‘but ‘high joint 
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selectivity for unresolved selection attributes, while ray have eis Biecing 
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factor but low joint selectivity. 


‘five’ quety’evanuntion itrategies 


Based upon the above critetia, we have developed 
(heuristics) for choosing the subffle sequence. Bach: ‘sitategy’ ts based upon one of the above 
"criteria or uses a function of both criteria to tank the’ ‘Subfites" ‘in. ‘some: sequentiat ‘order. 
Needless to say, we do not expect that any’ single strategy would be able to find the optimal 
sequence for al) queries made to.a database. which is partitioried in any manner. However, 
we require that the sequence chosen by a. good strategy never, to be far from the optimal 
sequence. In order to compare the different strategies which we present, we have conducted a 
set of experiments « an each of: the strategies. . Tn. order to deterroine to what ‘degree the 
determined strategies are optinon| and to what extent, ihey may serve the purpose of query 
evalua eA: we have also applied the set of coals be iwo other “control” strategies, and 


compared the results with the sesults.of the five strategies, The five strategies considered are: 


(a) 
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Least Page Access (LPA) - In this strategy, the subfile that results in the least number of 


Es page accesses is linked. That is when there are 3 a number of subfiles containing 


~ (b) 


unresolved attributes, the query evaluator chooses. to tink. to the subfile that would result 
in the least number of page accesses. ‘This is in sccordance with first of the two 
ordeiiig criteria’ discussed above. Inutvey, linking the first few ‘mbites will result in 


not too ee) page accesses, and as the subi that incur many page access are linked 


ns on, the joint selectivity of the attributes renaichcs » far will be sufficiently high 


such that not too many page accesses will be , made to resolve the remaining attributes. 
As mentioned | above, this. strategy amounts to sequencing ine: een according’ to 


Seereotme blocking factor. 


Least Page Access by Pairs (LPAP) - - In this strategy, the ‘query evaluator looks at all 
ordered pairs of subfites, For each pate, the query evaluator ‘computes the cost of linking 


to the first subfile of the pat end adds to D the cost bei subsequently linking to the second 


subfile, ‘The computed cost for ali the pairs Is + compared and the query evaluator selects 


the pair with the least cost to be the next two subfites that are linked in the method. 


Note that when the query evaluator computes the cost of each ordered pair of subfiles, 
the second subfite will be linked froma subset of the TID list from which the first subfile 
is linked. This is because after inking t0 the first: -subfite, the TIDs of subtuples that did 
not satisfy the selection attributes in the first subfile are pruned from the TID list. 
Thereafter, the query: evaluator ‘reappties ‘the LPAP strategy to the remaining subfiles to 
select the next two subfites that are to be linked in. the method. The reapplication is 


repeated until all the subfiles have been: sequenced: Everytime a pair of subfiles is 
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(d 


(c) 


— 


of the DBMS 


selected, the TID = ‘is-edmond por , sation HB tet of saps thet 4n-addition — ‘ie 


“selection attributes of the pair of eubfites peace. 


‘The LPAP strategy és similar 40 the: LPA saategy tn that the criterion for 


sequencing is the sumber of page accewes ‘However, his strategy tok. at two subfiles at 


a time and also considers the: joint seettnny of the unresolved selection attributes of the 


first subfile in choosing the subfile par. Thereoe, this strategy mitt ane result in 
better methods compared. to the methods chosen by the LPA srategy Observe that if 


there are only two subfites that hare tobe sequenced tthe second sage of query 


eva luation, then this strategy will find the cpt segue, 


fignen’ Subfile Selectivity (88) : an this sere, subfiles are yaregsies according to 


ett ae ae eet oeaerte 


their resolving power: The ‘subfile containing ‘wnresolved selection: attributes with 
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highest joint selectiiy is chosen to be linked first, and the ble with the second 
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highest joint. selectivity is linked second, etc. This is in accordance with the “second 


ordering criterion ‘discussed above. The idea here is to 5 redune the size of the TID list as 


fast as possible. 


Highest Selectivity and Least Pages (HSLP) - Jt is desirable;to order. the subfiles both 


according to the joint.selectivity.of the selection. atteibutes and: according te. the number 


of pages ‘accessed when linking to them. The previgus. ptratagles: chase. one or the other 
. as the ordering criteria. This strategy combines the, two criteria. by- ordering, the subfiles 


_according to the (increasing) product of the joint, selectivity. of the.selection attributes and | 


the number of page accesses incurred in.Jinking a the subfile; the subfite with the least 


product is selected and the strategy is reapplied to the remaining subfiles. Everytime the 
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(e) 


strategy is applied to the subfiles, the subfite with the least product is selected and the 
number of subfiles that are to be sequenced. is reduced by one. This strategy is based 
upon. the asrampnion that considering both. criteria will result in. a superior raethod 
compared toa method that is found using a. single criterion... ‘Note that in this strategy, 
everytime a subfile is chosen, the TID. list. is reduced to reflect the resolution of the 
attributes in the newly chosen. subfile (Le. the joint. selectivity. of attributes scscived so far 
is multiplied by the =e of selection attributes in.the chosen, ‘subfile). Thereafter, 


when choosing among the. fiona subfiles, the number of page accesses incurred in 


linking to a subfile is computed from this reduced TID Uist. 


Highest Selectivity and Least Pages by Pairs (HSLPP) - This strategy is like the Highest 


Selectivity and Least- ‘Pages strategy except that all ordered pairs of subfiles are erretes 


: together. For each pair, the number of page accesses (compared in the same way as in 


the LPAP- strategy) is muttiptied by the joint. resolving power of all ‘the selection 


attributes in the pait of subfiles. The pair with Ue smallest product is chosen. The 


eeateay is then applied to the remaining subfiles. ‘Compared to the HSLP Strategy this 
strategy performs a search of depth two and hence will-reaue in superior methods than 


those found by the HSLP strategy. 


We have conducted a number of exper iments on the above five subfile sequencing 


siiatecien-* The experiments varied over two different sets of query usage patterns, two 


partitions: three sets of attribute lengths, and three sets of attribute selectivities. The results 


’ given in the table below are the average for each strategy's performance. The two strategies 


Exhaust and Random are “control” strategies against which the other strategies are to be 
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compared. ‘The Exhaun steategy finds the optimal sequence of subfiles by exhaustively 
enumerating all sequences, and selecting the sequence which resutts in the least processing cost 
for the query. The Random strategy ‘finds a sequence for the subfites by randomly choosing 
one of the possible subfite sequences, in what amounts to a ro-strategj. The first row of 
Table | is the ratio of the average page accesses ‘for each sraegy with Tespect to the page 
accesses of the ‘Exhaust strategy. The second row is the ratio with respect to the Random 
scrateey (for the same set of experiments). 

The petorniene: : of the Least Page Access by Pairs: strategy was very close to the 
optimal Eerigrmanicr By the performance of a srategy we fean ‘the cost of answering the 
queries in the usage pattern who each “query Is ovalugerd scrording, to the strategy. The 
_ Least Page Access strategy also compares favorably to the. oiner strategies. The Performance 
of the strategies that considered the joint selectivity were not as gens J as the oe strategy. 


Even the LPA strategy, ¥ which only considers the number of age ROCEAEES, performed better 


- 1.0 1.425 1.103 1.004 1.288.246 1.055 


0.701 1.0 0.773 0.704 0903 0875. 0.740 


Tablel The results of different query evaluation strategies. . 
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than the HSLP strategy that considered both the page accesses and the selectivity. _ We 
attribute this partly to the fact that after the first subfile has been linked, the joint selectivity 
of the resolved attributes has become high enotigh so that the ‘second subfile incurs 
comparatively fewer page accesses than the first'subfile. Thus it becomes important that the 
first subfile incur as few page accesses ‘as possibte. 

The LPAP strategy is very don to optima and may be considered. as the choice 
for a query evaluator in a partitioned databasten vironment: However in our work, we have 
choseri the LPA strategy because of the following reasons: I+ The LPA strategy ts 
searenteel 2. Fhe LPA ‘strategy is computationally efficent compared to all the other 
strategies. Since the nner of pages accessed in linking from a TID list to a subfile is 
expression 4.21), a query evaluator based on the LPA strategy initially has to order all the 
subfiles of a partition according to decreasing blocking ‘factor. For each partition, the subfijes 
of the partition need to be ordered orily once. Thereafter when evaluating a query, the query 
evaluator ‘seqjwoncen “the subfites: that contain ‘unresolved selection attributes in accordance 
with the precomputed sequence based upon: ‘the subfite blocking factor. 

The figures of Table | are performance averages over different queries, fiartitions, 
attribute lengths, and attribtite selectivities. Obviously, some strategies perform better than 
others -for certain queries and partitions. It'was ebserved that in general, as the number of 
attributes in the selection components of queries increases, the performance of each strategy 
deteriorates with respect to the Exhaust strategy, with the strategies that consider only a 
single subfile (the LPA, HSS, and HSLP strategies) deteriorating the most. Also, it was 


observed that the larger the number of subfiles in the partition, the less optimal the 
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performance of the various strategies. 


The above discussion concerned Sejaiale soars: For disjunctive queries, the 
query evaluation strategy used ts very. similar to the strategy. waned for. qonjunctive queries: If 
the indexed selection attributes are highly selective, ard I.the suble-comtaining the indexed 
selection attributes also.contain ueindexed selection attedbuten, then ‘with great likelihood, this 


subfile will be searches in: its. entirety and using dadigns mill. mot be: very effertive and may: be 
avoided. Otherwise, the full. set..of applicable.dadice:.ts. wed....The subfile containing 


unresolved selection attributes are. then. sequenced according to the LPA strategy: (ie. 
according to decreasing blocking factor). For a. disjunctive query, the joint selectivity of the 
resolved ‘attributes. is. compyted 3 CKpRa to expression. 9.6.2. from: the individual attribute 


Selectivities, 
A disjunctive query is resqived differently from & Conpunctive query in the query — 
resolution. phase: When a TID list is obtained by Baking to-2 sublile, the union of the'new 
TID list. is taken with the old TID Hst.. The rewiting THD) Sst is thes complemented. to 
obtain a list of subtuple TIDs that do not: satisfy any of the atesthontes peagived so far. This 
complemented TID list is used. when Unking.to. the “nent sulfide. tn: the method. 
Complementing a TID list is. accomplished, by repeatedly generating. aubfile TIDs using 
srpresigg 3.2.2 and ane eee ages does net eel lst. 


After the query evaluation phase, the query's method is passed to the query eesoivel 
which actually produces ‘the TID list ‘of selected subeuples, ‘The TID lst is then ‘aed for 
linking to subfiles containing. projection attributes, Depending on the transaction te, the 


query answerer does the following: 
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Query - The subfites containing the projection attributes are linked from the TID list 
constructed at the query resolution phase. The selected subtuples are retrieved from the 


subfiles, and the values of projection attributes are extracted and returned. 


. Update - The selected subtuples are retrieved from subfiles containing projection 


attributes (as for a query), all attribute values to be updated are updated (in primary 
memory), and the subtuples are written back in their previous location. An update 
incurs as many page accesses as a query in the resolution phase, and twice the number of 


page accesses in the answering phase. If any of the updated attributes are indexed, then 


"the affected indices are maintained as appropriate. 


Deletion - All co-subtuples of the. selected tuples are retrieved, marked deleted, and 


written back in their previous locations. A deletion incurs the same cost as a query in the . 


resolution phase, and twice the cost of retrieving all co-subtuples (ie. the entire tuple) of 


Selected tuples in the answering phase. The affected indices are maintained as 


appropriate. An overflow garbage collection may ensue if there are too many deleted 
; 


tuples in the file. 


An insertion is different from the other transactions. Assuming that the unused 


tuples are uniformly scattered throughout the file, inserting r tuples in the file incurs 


twice the number of page accesses required for retrieving r uniformly distributed 
| subtupies from each of the subfiles. This number ts ‘computed ‘from the page access 
‘function of Chapter 4. If the unused tuple stots in the file have been exhausted, then the 


“excessive inserted tuples are appended to the end of the file. In this case, the number of 
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page accesses incurred for each subfite will be the number of appended subtupies divided 


by the blocking factor of the subfile. Indices are maintained xs approprigte. 


‘We note here that the optimal attribute partition is independent of index 
maintenance, an the cost of rralivaieny the indices is incurred eaten of the choice of 
partition. Also we have taken the two problems of index selection and attribute partitioning — 
as separate, assuming that the set of indexed atertbvutes is fixed. “Shersfote: index 
maintenance cost will not sa our objective cost finction, and we may eliminate it from 


further consideration. 


6. Parameter Acquisition 


The Parameter Maelo ronitors the vataiaaed — system and collects 
statistics both on the usage pattern and on the responae of the database management system | 
to the queries. The statistics collected are used to forecast database and usage pattern — 
parameters for the next time interval. A time interval is the time span between two 
consecutive repartitioning points. The forecasted parameters will be used by the file cost | 
estimator and the attribute partitioning heuristics at the repartitioning point marking the end 
of the time interval. Monitoring the database management system isa real time activity; it 
has to be performed while the database management system processes transactions. For this 
reason, only those statistics that nS be inexpensively acquired should be coltected. Also, the 
statistics collected must be succinct and require title storage for thetr preservation. The 


"Statistics collected for the purpese of attribute partitioning fall inte Sour general classes: 


" selection component. and the set of attributes in. the projection 
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Database Usage Statistics - For each query.. made to the database, | the type of the query is 


__ stored in a table of query types. The type of a sey is fhe set of attributes in the 


ponent and. a flag . 


indicating whether the query is conjunctive or disjunctive. Catsiginy all queries with 
the sanie attributes (but with possibly distindt strribute ‘Values in the equality condition 
predicate) are cluitered together in the ‘sate eiitry of thé table. (A query type may be 


encoded as a bit map for the sake of siccinciness): Our atsimpt 


” that the fraction of 


tuples satisfying ah equality condition predicate depends only ‘on ‘the selection attributes 


and not on the attribute values‘in the seléction coniponent makes' this ‘clustering scheme 


possible. ‘The number of queries that aire cliitered’fiv'thé query type is recorded along 


__ with the query type in the table‘entry. 


Average Relation Size and average Blocking F Factor - " The number of ticles in each file 
is needed for the purpose of cost analysis. This statistle is continuously updated by the 


number of tuples inserted or deleted so that it. reflects the instantaneous size of the file. 


The blocking factor of the file (the number of tuples per page) is also required for cost 


_ analysis. ae blocking actor ata certain polnt in ™ theme interval is the number of 


tuples in cane file divided as i the number of pages in the file at that point in the time 
interval. The number of pages in the file is also updated continuously as pages are 


allocated for oo tuples or as pages are released after garbage collection, so that it 


reflects the true state of the database. Since Nec get wilt be continuously inserted and 


deleted, while some t tuples will be temporarily unused (until a tuple | is inserted .in place of 


a deleted tuple or until the next overflow garbage collection occurs), a fixed value for the 
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factors observed at a number of i ‘in the time fa airy 


blocking factor over the time interval'will'at best reflect an average of the true blocking 


factor. The average blocking factor parameter is obtained a oy averaging the blocking 


Attribute Selectivity Statistics - This satiate is the fraction of tuples that. have 
historically aetaike an equality condition predicate qn. the attribute. To compute the 
selectivity of an atiribute, the parameter. acquisiter, records the number of times the 
attribute occurs in. equalty ‘condition predicates of qperies, and for each such query, the 
parameter acquisitor records the fraction. (or an ‘perennation thereof) of tuples that 


satisfied the equality condition, The average of thege: : tons As thus, the attribute 


selectivity measure. Below, we describe how the ra ‘ on of pelested, tuples is determined. 


- Let oy be the fraction of tuples that sais an equatity condition predicate 


involving the ith attribute and occuring in the ph oT. “The: attribute wall be resolved 


by either sequential searching, Indexing, or “inking. a ‘the attribute e ‘resolved ‘by 


sequential searching. (ie. the subfile coneaining the tribute is searched in its entirety), 
then 4 can be precy caulated stthe rato othe mame of tape satistying the 
equality condition predicate ° n to the total number of | tuphes ‘¥ the attribute is resolved 
by indexing, then a TID list will be obtained thet pints to the selected tuples, and ¢ is 
precisely the ratio of the size of the TID list ton. If the attribute is resolved by linking, 
she ‘ has to be cohcconed ina reduced apie pace and then extrapolated to the entire 


— “atk Ue 


space. This is because linking. ts performed 
have been identified beforehand, Sor Wee Ne dea Bo oa that 


additionally satisfy the predicate. “Depending on ‘whether query j ‘1s conjunctive or 
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disjunctive, the estimation of ¢,, will be done as follows. 


(a) Suppose the equality condition appears ina conjunction of L equality conditions of 
the form: - | ee ae . _ . 
Cy A Ce A wee AO 
where C, is an equality condition involving attribute a. (The order of the equality 
conditions above reflects the order the Predicates are sequenced in the t query's method.) 
Let no be the total poner of tuples in the relation, and let " be the number S tuples 
that satity Gy A Ge I vmesee A Cs (Note that il numbers are readily available from the 


query prea when it resolves the tamaction) ~ for query can then be 


approximated as: 
oy 


(b) Suppote the equality condition a appears ina junction of L cual conditions of 


the form: 
CY Cp Vim VG 
where. C; . is an equality condition involving attribute. Gye (T he.ovder of the equality 
| _ conditions above reflects the ordar the predicates: are.sequenced in the query’s method.) 
Let ae be the total number of. tuples in the-relation, and-tet n, be the number of tuples 
that satisfy ~G,.A 3 A tum ACO AG. (Again, these: numbers are readily available 


_ from the query processor.) The fraction of tuples antisfying. C,. of query. j can then be 


approximated as: 


Chapter 3 os =. The Model of the DBMS 


Re Weak Tia be say 


The attribute selectivity s; for attribute, may now be computed 


as the average of 


o, for all j€Q: 


bee 
s, = — 2 jeQ% 
: Ql: € sear 


vez P nt pst 4 te 


where. Q is the set of queries made to the database during the svi time interval. 


Ave seh bake wees Fa op esas GP eas 


By averaging the fraction of ups siping the actual o occurences of an attribute in the 


SR Be iy EE ty 


queries, we have taken into consideration both ewes | in the distribution of attribute 
Sa pads. aC Ung te gear 

values (for the attribute) in the file as well as shiners in the distribution of f attribute 
value occurences in : aii 

The selectivity of an attribute should change if inaainal the distribution of its values in 
the file EnANECH: © or if the vane in te era condition snared of pamela involving 
the attribute chee, Since the shone c changes occur when tuples are inserted, ceeret: or 
updated and also as the database usage patern evotves, the strbute selectivity measures 


need to be continuously updated to reflect the ‘recent and more accurate DEO: 


3k ed wie 


The attribute selectivity measore is kept’ up to date %y { ya 5 Fenming average of 
each attribute eeleneny as the bictainad ‘of taples sentying an \ eqeaity condition 


predicate on the ateribuee t ts catculated in the process of query: résolution. Every time a 


search is done on an dcsehnapteriattin {ite (or subfibe), the attribure’s slecvty ts updated by 
the weighted average of the old pasaebetd arid the Fraction of ‘the tuptes ‘selected in the 


search. 


_ After the individual attribute selectivities have been obtained, the joint conjunctive 
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(3.6.1) : n. 


or disjunctive attribute selectivities may be computed from them.. Our assumption of 
Independence ae attribute value occurences in tuples leads us to simple formulas for 
the pin pawiioniek The expected fraction of tuples that satisfy a canjunction of 
equality cai Simultaneously is equal to the product of the individual expected 
fractions that satisfy each equality condition. The joint conjunctive selectivity of a set of 


attributes I, each with selectivity 8; is: 


s 
iet! 


Similarly, the expected fraction of tuples that satisy a disjunction of equality conditions 
simultaneously is the complement of the fraction expected not to satisfy any of the 
equality: conditions in the disjunction. The joint disjunctive selectivity for a set of 


attributes I, each with selectivity s, is: 


(3.6.2) 1-11, ,-s) 


The last: statistical information needed is. the performance cost of the partitioned database 
in the current time interval. This is the cost (in terms of the number of page accesses) 
incurred when the database management system answers all the queries in the usage 


pattern. This statistic is the sum total of the number of page accesses made in answering 


queries since the last repartitioning point. The parameter acquisitor. updates this figure 


everytime a query is made to the database. This statistic is used to determine the extent 
to which the partitioned database performance cost comes close to the performance cost 


that had been estimated at the previous repartitioning point. If the partitioned database 
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performance cost és al within a reasonable distance from the performance cost that was 
forecasted for it, then it may be concluded that the current usage pattern no longer 
reflects the forecasted usage pattern and that the current attribute partition is no iorigee 
suitable. If the database poriorniance is diagnosed as such, then a repartitioning of the 


database may be initiated. 


When repartitioning is initiated at a repartitioning ee the parameter acquisitor 
takes the statistics collected in the time interval since the last repartitioning point (and also 
the statistics collected acing previous time intervals) and forecasts parameters for the time 
interval up to the next repartitioning point. Specifically the parameter acquisitor forecasts 
the following parameters. | 
(a) The frequency of occurence of each query type. 

(b) —- The size of the relation and the average blocking factor. 
| (c) The average apaey of each “ore | 

A thorough discussion of the exponential smoothing forecasting eechhdigie that 
should be used for the purpose of predicting the above set of parameters appears in [15] and 
{7}; we will only give an outline here. Intuitively, exponential smoothing uses a weighted 
moving average that fs based on two sources of evidence: the most recent observation and 
the forecast made ss ikl The new forecast ts el oa percentage (known as the 
smoothing constant) of the recent observation plus the complement percentage of the previous 
forecast. Exponential smoothing has a number of advantages including simplicity of 
Somnputation: minimal storage requirements, adjustability for responstveness, and 


generalizability to account for trends: A variant of exponential smoothing known as adaptive 
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. forecasting may also be used. This. technique takes trends in the parameters into account, It 
isan effictent technique and more reliable than exponential smoothing, and may be preferred 


to simple ss aia smoothing in some cases, _For a th oroug! 


discussion. of the different 


serecayane techniques, the reader is referred to the two works, a , d above. 


7. Repartitioning Points 


se 


Database Pepa veiling points may be determined in several Sask Repartitioning 
may either be initiated by the database administrator whenever the database administrator 
deems necessary, or may be initiated by the parameter acquisitor. One way to have the 
Scihimate acquisitor itself initiate répartitioning is to require it to prepare at each 
-repartitioning point a forecast of the usage pattern and the database parameters for ‘ 
number of periodic checkpoints into the future. For each checkpoint, the performance cost of 
the. partitioned database is forecaited, During the course of monitoring the database 
ma nagement system and the performance of the partitioned database, whenever a checkpoint 

is reached, the patients acquisitor compares the observed performance cost with the 
performance cost forecasted at the previous repartitioning point for that checkpoint. If the 
observed performance is inferior to the forecasted performance by a margin that is not 
acceptable, the parameter acquisitor may conclude that the current partition is no longer 
suitable for the current usage pattern and should then initiate repartitioning. When 
repartitioning 4s initiated, forecasts of the usage pattern “a database parameters are 
prepared for a number of periodic checkpoints into the future. (Finding the optimal set of 


checkpoints is itself another database optimization problem. We refer the reader to a brief 
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discussion of the problesn of optimal determination ‘of . Fepartitioning : points peeeniet in 
Section 7.1) The attribute. partitioning: heuristics are.then invoked. te find 2 suitable partition 
that is optimal or near-optimal for the forecasted usage. pattern. Af the proposed partition is 
different from the curtent partition, the current attribute. partition ts-aho cost. estimated for 
the forecasted usage pattern, If the cost of the perenne a ts lon: than the cost of the 


current partition by a. iad that justifies reperiioning, the reperticioning of the database 


is carried out. 
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ap 


ak 


In this chapter we will analyze the cost of resolving and answering a query ‘made to 
| the database and describe how the file cost estimator’ ‘derives the (ota system performance cost 
“for a given. partition and a set of queries. (Henceforth, we will use the term partition 
performance cost to mean the performance “cost of the “database management system 
partitioned in a specified ‘way, in response to the queries in the usage pattern. Also, we will 
use the term evaluating a parehion to mean the derivation of thé partition’s performance cest 
by the file cost estimator) Each of the attribute partitioning heuristics repeatedly calls upon 
the fite cost ‘estimator to evaluate partitions they propose. Thereafter, they select the partition 
with the best evaluation and based on it propose another set of partitions. Each proposed 
partition is cost evaliiated by the file cost estimator andthe parton with the best evaluation 
repeated. By this process, the heuristics try to propose partitions that result in successively 
better evaluations. So in a sense, the file cost estimator may ‘be viewed as our “objective cost 
function, which the heuristics proceed to minimize by proposing ‘petier and better partitions. 
7 We will assume throughout that ternal ‘processing costs (CPU: costs) are 
insignificant and the performance of the database maior Syatem we model is bounded 
by input/output operations (page read and writes) “and hence: that. page accessing ‘cost 


dominates all internal processing costs. Internal processing costs include the costs of query 
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evaluation (assumed to be negligible) and of obtaining. intersections and unions of TID lists 


“in query resolugion. We ayume that. forming. the nuars 
entirely done in primary memory and-so does net.incur any page access to storage devices. 
Cece the index of an-attribute and retrieving the TID: list of: the: index, do: incur page 
accesses. Therefore, we arias these poste, in computing: the partir’ 's performance cost. 

evaluation. This is because 


‘We do ibaa data storage costs ih the arti 


if page breakage at the end of a subfile is. ignored, the amount: ‘of storage required by. every 
attribute partition of a file is the same. No matter how the file Is partitioned, the storage, area 
required for that me wi be ue ‘same as that for the one-fite 1 ther pies, an. insignificant 


number of pages due to page breakage. When repartitioning. an.agel thute partition, for gach 


$, soe 


subfile at most one page can remain unfitied; _ the: change in stopage requirement from one 


partition to another carinot exceed the maximem number of subfites in the. two partitions. 
Since this figure is usually insignificant compared to the total number, of pages required Jo 
store the data, we “may safely ignore page. Pena and . hence. storage | costs ‘from 


consideration ‘in the evaluation of a partition. 


Based ‘upon the above asmumetions. tne performance cost of a parition will, be the | 


cost of accessing the subfiles in order to answer the queries ip, the wage patisen.. Since.page 


access cost is Proportional, to the number of page Berens, our Sot, anal, will solely be 


concerned with the number of page accesses incurred. in answering. a query. Before we 


discuss the file cost estimator, we give the page access analysis for each of the. sequential 


search, linking, and indexing access paths. 


ction andeupiigg-of TH, lists can be 


PRPS rp teeiecr s Ce ee ORR 
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1. Sequential Search 


"If a query’s method does not specify any index to be searched (this happens if none 
of the query’s selection attributes are indexed, or if the query evaluator deems the indices | 
useless for resolving the query), then the first subfite of the method has to be sequentially 
searched in its entirety (the rest of the subfites will be searched using links). In a sequential 
search, all the pages of the subfite i retrieved, and their subtuples are matched against the 
attribute value specified in the query selection predicates, and the TIDs of qualifying 
subtuples are stored in a TID list. If F, ts the subfile that is being sequentially searched, n 
the number of pies in the subfile, and ee: the blocking factor for F,, then the number of 


page accesses will be equal to the number of, pages in the subfite: | 
Te) 


_ The blocking factor b; is equal to the system page size Ss divided by the length of the 


“subtuple. If A, is the set of attributes in subfite F, ,and |, the length of attribute a, then: 


“Lf Wd 


8th, 


2. Tuple Retrieval Using Lin 


Assume that we have a list of TIDs pointing to the subtuples of a subfile. We want 
_to conipute the number of page accesses incurred in retrieving the subtuples with TIDs in the 
list. (Such a TID list might have been obtained either by a sequential search on a subfile, by 
following a previous link to another subfite, by indexing on an attribute, or by forming the 


intersection or union of TID lists obtained in any of the previous ways.) In any case, an 
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estimate of the number of TIDs in the list (which ts equal to the number of naples to which 
‘they point) is readily available. from the joint selectivity of the attributes that have been. 
resolved so far, whose resolution has resulted in this TID list. If s. is the joint (conjunctive 
or disjunctive depending 9 on whether the query, is conjunctive or disjunctive) selectivity of all 


dpe he é@:sgett 


attributes that nave be Saree: i ie Doras of are TID sia and La n. As the number 


of f tuples in the subi, then the length S es TID tise is is approximated b A sen. 


Our cost criterion. for performance optimization is the number of Page accesses: in a 


Ahh et ae 


paged rmiemory environment in which tuples: are: blocked together in, pages. © we have to 


* aa BO iden coches 


translate the expected umber of tuple accesses be the Speed. number of page | accesses. ‘The 
expected number of pages t to, be accessed is alas les than ¢ or eal to the number of tuples: 
to be accessed because two or more tuprs may reside on the s same page. In our model of the 
database management system, finding: the expected number * page accesses is relatively easy 
because of the foltowing properties, which hold asa res or the @ ashuptons we have made 


about our file and index models: 


| The TID list is ordered. Whether: the TID Hist is cbraied by sequential searching, 
linking, or indexing it is ordered (i.e. sorted fn increasing or. decreksing value) and 
subsequent intersections and unions: -preserve:this ordering. “ths propery of the TID 
list assures that each page of a subfile is retrieved at most once (since the tuples are 
retrieved in ae sequence they reside in the subfile. This ‘property also eliminates the 
need for large buffer. areas in primary memory to accomodate inputfoutput operations, 


since at any instance, at most one page will-be in’ primary memory.) 


2- The TID list is not redundattt; te, no TID appears more than ‘once in the list. 
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3- The TIDs are distributed uniformly over the TID space. This property is assured by 
our assumption in Section 3.4 ef independence among the occurence of attribute vatues 
in tuples. Hence all TIDs appear with equal probability in. the TID list, and the tuples 


they point to-are scattered uniformly throughout the subfite.. 


These three properties of our model of tuple access makes the translation of the 
| “number of tuple accesses to page accesses relatively simele.. Based. upon these three 
_ properties, Yue:and Wong [40] have derived, the number of page accesses from the number 


- of tuple accesses in terms of the recurrence relation 4.21, 
(4.2.1.9) + Aln,b0) 0. 
| . Spe n 
(4.2.1.6) A(n,b,r+1) = A(n.b,r) + ae 


In the above formula, A(nb,r) is the expected number of pages accessed from a file (subfile) 
with on tuples (subtuples) and b tuples (subtupies) per page when iadovne r tuples 
" (subtuptes). (Note that r=sen in the cost analysis above) The computation of 4.2.1 
involves on the order of r multiplications and r divisions, and is therefore quite expensive 
to compute. By the technique of generating functions, we have solved the ecbanes relation 


4.2.1 and have obtained the closed form solution 4.2.2 for the number of page accesses. 
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(4.2.2) , Aln,b,r) = “fi. 


The above formulation has the advantage over the recurrence relation that it can be 


ratio of thie Formulation 42.2 (hereafter called the 


apa 


i‘ ‘tr Waters (88) and Yao [s7) have — 


page access Thinctibi ‘Gllvig the hypergeometric distribution of 
probability theory.) The formulation 422 ‘tio admits of a simple interpretation. For an 


page access furichiony iiay be feulid ind in Apper 


independently arrived at t 


arbitrary page in the file, the probability that it oes fet teintain any of the r selectéd tuples 
is the number of ways of choosing b tuples from n=r tuples, divided by the number of 
ways of choosing b tuples from n 1 tupies. ‘Hence the expected umber of page accesses will . 
be the number of pages (n/b) times the complement of the above provabiny. 

During the course of attribute partitioning, the cateribute. partitioning heuristics 
repeatedly call upon the file cost estimator to evaluate partitions. Every time the. file cost 
estimator evaluates a partition, it has ts estimate the cost of answering each of the query types 
in the table of query types. Estimating the cost of anewering each query, type involves 
compuling.the number of page accenes Incurred in. screing. gach of the subfles that 
contains an ere in the sciectice or projection anes of the query type. Since for 
each such subfile, we have t compute the page access function, it is Amportant that the page 
access function be computed as efficiently as possible. The page access function 4.2.2, if 
expanded, will take on the order of mney) mukiplications per computation. Abhough the 


page access function is much more efficient in computation than the recurrence relation 421 
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ince, b is usually rouch smaller than r), computing the page access function in its exact | 
form 4.2.2 is still too costly f for our purposes. Instead, we use the following approximation to 


the page access function ee atch Michael Hammer) in our file cost estimations: 


pe togtl. - 


r 
ae nt: n ~ (b-1)/2 
(423) Alby) a ade I 


The approximation 4.23 has proven to. be very fax, fanaa only: a constant. number of 


multiplications and divisions per computation, and has the advantage of extreme accuracy for 


almost every combination of n, b,and ¢ f6h 


Using the index of an attribute.of a le (or subfite), in order to retrieve tuples that 

have a given value for that attribute, is composed of three steps. The first step is accessing 
the non-leaf pages of the index to get a pointer to the TID, Mist. of tuples with the given 
attribute value. The second step consists of retrieving this TID list The third step is 
retrieving the tuples that the TIDs point to by retrieving the‘pages they reside in the subfile. 
| A detailed anatysis of indexing costs appears a Chan [7] and we will not reproduce it. We 
shall only repeat here the final expression derived wm ‘Theaverage cost of using an index 


is: 
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Foe, WS b Lin + Lés)u,$)1 + Fens Ulyst +. Toate een) 


where on = number of tuples in the file 


fom 


"= blocking factor of the indexed attribute's subfite 
‘L = length of the indexed wtrtbute : 
s = selectivity of the indexed attribute af 
'" U, = average fraction of index node page utilization 
uy “= average fraction of index. leaf page utilization 28S 
- _S_— = system page size. | Bo ee ay, 
The three terms of the expression are the respective, coats of the Ahree-indering steps. The 
last step of index use, i.e. sini the qualifying tuples, actually occurs if this attribute is 


_ the only one whose index is. used: iy the’ mated: of the -¢ ns ek 
intersection or the union of the TID list obtained ba the second se of enn is taken 


with other TID lists before the «pe that a are eos w are retrieved. | 
4. File Cost sin tien oo 
The file cost estimator, srahuetea-a: partition: progueed-by she panoning heuristics 


-and computes. the performance cost..for that. pariiion, Fhe: rf 


. pigiesed Partition is estimated by iterating: over the: quegies: 4 inthe tabeof wery: types: and 
estimating the cost’ of answering each query. (This table ts provided by the parameter 


FCO: of each 


acquisitor and is a forecast of the database usage pattern for - wext. time interval) Each 
query type in’ the table is passed for evaluation to the query evaluator. The query evaluator 


uses the Least Page Access S, mrategy * and thereby produces a near optimal method for the 
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query. (The Least Page Access strategy, as described in Section 35, sequence the subfiles that . 
are to be linked according to decreasing subi blocking factor.) | 
ene file cost estimator receives the method for ine query. If any index is specified to 


be accessed ‘by the method, the file cost ethnater uses the cost expression for index | use to 


‘aac the cost of Hocessing the indices. If no. index. is specie by the query's method, the 


first subfite in the method has to be sequentially searched and the cost of the search is tie. 
number of pages in the cabbies (The reason that the first ‘subfile, ‘must be sequentially 
searched is ‘that iia there is no TID list on hand that would restrict the gearch to certain 
pages of the subfile ae peal searching ney. be viewed as a limiting. form of linking, where - 
each page of the subfile has to be retrieved.) In either case, Le. if indices are used or the 
subfile is sequentially searched, the joint serectiysty of. the attributes resolved so far can be 


readily computed from expressions as u362, depending on whether the query is conjunctive 


. or disjunctive. The set of attributes I in $61 and $62 is the set of atubute. resolved so 


far.) The remaining subfiles of the method are sequenced and are to be linked in the 
sequence specified by the method. Using the approximation to the page access function, the : 
file cost estimator computes the cost of accessing the frst of these = (Observe that r, 
the number of tuples to be retrieved, — the product of the joint selectivity of attributes 
resolved so far and the nurover of tuples in the subfile n). ‘The access cost estimated for this 
subfile is then added ito the cost of indexing/sequential searching. The file cost estimator 
then derives the new joint selectivity figure by including the old joint selectivity figure and 
the selectivities of all the attributes resstvad in this step of the method in expression $.6.1/3,6.2. 
The cont of linking to the icine subfites of the method in the sequence specified is then 


computed for each successive subfile in the same way. The cost of accessing each subfile is 
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added to the accumulated cost, and the joint mieoy is apd: according, to the 
selectivities of the newly fenived attributes. “When no subsite remains in the method, the file 
cost estimator will ave computed the cost of resolving | the query ‘and also the i ae 
of the query (and hence the number of tuples selected by the query. ‘The file cost estimator 
then computes. the cost of answering the query. Using tine sppronimation ts to the page access 
function and the estimate of the number of tuples that 2 are selected by the query, the file cost 
estimator estimates the number of pages that ‘need to be retrbeved from each subfite that 
contains any of the projection attributes. The only subtile containing, a projection attribute 
that does not incur page retrievals in the answering phase is the last subfile in the method of 
the query. This is because the projection attributes in this ‘subfile can n be retrieved as the 
selection attributes of this subfile are being resolved. ‘The cost sccurnuated in the recline 
and answering phases is then summed to give the c cost ‘estimate for ‘the query. A query’ ‘cost 
estimate is then multiplied by the frequency of the query type t to ee the total cost estimate for 
that query type. Finally, the sum of these weighted ery ‘cost estimates is the Ean 
cost of the partition (in the context * the forecasted usage paver). ‘ | 

The file cost estimator is catted repeatedly in the process of attribute partitioning. At 
is imperative that the file cost estimator be implemented efficiently. "Note that by ‘clustering 
all Gusti: with the same type into one entry of the query type table, w we have already reduced 
the totality of the queries in the usage pattern into a relatively amar set of query types. 
Hence, the number of the iterations required by the file cost estimator has already been 
reduced. ‘Although further clustering measures itke ithe "nearest centroid” clustering scheme 
of Belford [5] could be. employed to still pores the number of query pes, the tiecree of 


query cluuerine we have employed has proven “sufficient for our purposes. Tests show that 
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the file cost estimator (as programmed in the prograr ymin language MDL [26] on a PDP-10) 


takes somewhat less thana second of processing time to estimate, the. cost. of a set of ae 
Son senetve and disjunctive query apa Furthermore, when queries are additionally 
clustered, correlation information about _attritvute occurences ” queries. will be eeviably lost, 
and. estimates based on clustered queries will be ‘fess reliable. Therefore further me 


clustering is not advisable. 


5. _Repartitioning Cost 


The cost of repartitioning the attribute partition is computed as follows: if at a 
rSpartitioning point, the new partition has subfites Fetes Fy in common with the old 
partition, then only subfiles Fy, Lh have to be retrieved, reorganized, and written back 
on secondary storage. The total page accesses required to do this will be twice the number of 


pages in each subfile: 


23, frie] 


is 
The above cost is based on repartitioning all the subfiles F,, ...,F, simultaneously. Le., the 
pages of each subfile are read in sequence along with the pages of the other subfiles, the 
attribute values are then transferred from one subfile to another, and finally the pages are 
written back onto secondary og Each page of a subfile is thus accessed only twice, once 
for reading and once for writing. | | 
| At each repartitioning point, the performance cost of the partition proposed by the 
partitioning heuristics for the next time interval has already been computed. The 


-performance cost for the current file partition for the next time period is then computed. The 
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two are compared and if the proposed partition offers a performance cost reduction greater 


than, the cost of repartitioning, then the file should be reorganized according to the proposed 


partition. 


Chapter 5 ; nee! 8B The Attribute Partitioning Heuristics 


(CHAPTER 5 


THE ATTRIBUTE PARTITIONING HEURISTICS 


In this chapter we present a number of heuristics for partitioning the attributes of a 
file. Each attribute partitionirig: heuristic tarts -with'ti applied partition and derives from it 
a superior aithion (If the heuristic is not able to improve on the: supplied partition, the 
heuristic will terminate and: return the supplied partition as its result.) Therefore it is 
possible to apply the attribute partitioning heuristics in succession, with sack heuristic 
starting with the resurant partition of the preceding one and producing a partition that is as 
good as the preceding partition. We say that a heuristic is relevant to a partition if its 
apelkation will result in an Improved partition. < 

We have performed a number of experiments on the attribute partitioning h heuristics. 
The ee results of the experimentation performed: on each heuristic is included in the. 
discuasion of that heuristic. Since our most extensive program of experimentation was 
| spplied to our main heuristics, we have devoted Section 1 of the chapter entirely to a detailed 
discussion of that subject. Before we proceed to describe the heuristics, we will first establish 
"the necessary terminology for the subsequent sections. : 

Let P bea partition of the set of attributes A of a file into disjoint subsets. Each 
subset of A is termed a block of attributes; the ith block of the partition is denoted by A; . 
A. block of attributes may be viewed as a representation of a subfite; ie, when a file is 


partitioned according toa given partition P, each block A; of P is directly implemented by 
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a subfile with attributes drawn form A,;. If M is the number of blocks in the partition, then 
P= {Aj AnA=¢ for all i#j, and vA A. The trivial partition P° has been 


defined previously to be the partition where every subfile contains exactly one attribute. 


That is, P° = = {AQ}, , where A? = {a}. 


One way of pnaine the sptienal pariicion is to produce an partitions of the set of 
attributes, and evaluate each of them with the fle cost estimator in order to ientify the 
partition with the best ditbediouial cost. This exhaustive hacsieabengcid ereroacn is not a 
“viable paritroning strategy because of the forge number of posible bits to © partition a file. 
The. number of distinct partitions of a set of m lernents into Aliso subsets, Bim), is 
known as the mth Bell number. “Unfortunately, there 4 ish no ° simple expression iad Bim) that 
we can analyze in order to arrive at its complenty, However, , Benes and Wren (27) 
provide an. sivripletic expansion for the Bell numbers. This axymptotc mepension is in 
terms of the solution to the equation x xe% =m, ‘and hence | is not in cloved form: From this 
asymptotic expansion it is —- to derive the folowing asymptotic upper and lower 


bounds for. the Bell numbers [28} 

(te Bim) = o(m™) 

(5.1.2) | | “at 9" 6 o(EXm)) ,. €20 

ic € is any non-zero positive real number. (The notation fim) = tytn). denotes: that 
limmaco fm)/g(m) 0) The two asymptotic bounds are: very: tight; and: frome them ‘we see that . 


the number of partitions of a set of m ‘elements into disjoint subsets asymptotically 
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approaches one to m™ (or equivalentiy, cients 2 mee) as. m.. approaches infinity. By 
this we mean that € may be taken as, small. a8.possible as: long: as.it is positive, and | (mn) will 

sways grow faster than m‘'-9" Therefore for all. psactical purposes, the number. of 
distinct seis partitions. is prohibitivaly, high. to somata exhaustive . enueneration 
approach feasible (for the. general attribute .pastitioning. problem. with any number of 
attributes), . As an ‘example, a file ‘containing .. 10: etait can _be partitioned into 


~ B(EO) = 115975, “different partitions... with the, exhaustive enumeration 


approach is that generating.all the Bim) different partitions. is not an.easy task: A program 
written for" generating all. the. partitions..of a. set. of, ancibutes: (and which - was fused to 
exhaustively find the optima Partition for a mumber..of aurtbute partitioning Problems with 
“not more than 8 attributes) required storage.space that grew.faster them Bo). 


. The heuristics we have considered in our ‘work ‘nd described in subsequent sections 
‘are all stepwise minimization heuristics, | Sapmie minimization is the process of carrying 0 out 
“an optimization task in a series of sep. At each ep a cost criterion is optimized to the 

extent possible. Each stop th that follows carries the optimization stilt further. Finally, when no 
further “optimization can be performed» at a Bae the ce minimization process is 
_terminated. In the case of the atribute paritoning heuristics, each nerearte starts from a 
predetermined partition, and in each wp tries to come = with a new partion that is an 


improvement, over the partion of the preceding step. By Improvement we mean that the 


performance cost of the Improved pariton, as evaluated by the file cost estimator, is less than 
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the Periormence cost of the previous partition. Once an- improved partition is found at a 
step,’ the next. step starts with the newly found. partion ‘and tries to find “a’ still better 
"partition. This process of incremental lenprovernent 4s continued until no pattition may be 
found which is an smprovement to the partition considered int the: last step of the heuristic. 
The last partition is then returned by the heuristic as ‘the resukant partition of the heuristic. 
The intermediate partition found at eich sep ot the attribute partitfonitig heuristics. will 
depend on the partition of the fast step, the query Trequerches, and the query types. 

At each step of the attribute partitioning Heurtitics we “have considered, the 
irliproved:psireiien is obtained from the partition ‘of the previous step bi either ‘I- grouping 
"a number of blocks of ese:inie giertitton together tsa a single block, or by 2 dégrouping | 
a block of the prbviois partition into two or more bloc The heuristic we Have cbnsidered 
differ from one another in two respects: I- the attribute partition that they initially start 
_ with, and 2- the mariner in which the biscks ste greapu be deghétoped in eich step, : 
In our work, we apply a heuristic to an ‘initial alee oe in the course of 
| stepwise minimization, the heuristic produces 8 partion ae binaas oll can no siete 
‘improve. At this point we may spely a second eur to the result partition of the first | 
heuristic. After the srenaues of the second heuristic, a third heuristic nay be applied, or 
even the first heuristic may be reapplied. ‘Since a heuristic abways resoks *: = panttion | that 
“is as good as the partition ‘that it starts. with, a always posite to apply any neue of 
heuicistics in succession and. never ee a oo with a higher performance: cost (and 
Sccasionsiy get an improved partition). However, sme of the hearts we consider are best 
siiccoeded by cana other heuristics In the discussion of each heuriti, we wi make. it ‘Clear 


if the heuristic performed. well enough to warrant further r investigation, and if se, what other 
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heuristics were bade in combination with it. 


‘Note that one mode of operation we do not consider is trying a heuristic for only one 


ora few steps and switching to another heuristic before. the first heuristic produces its final 


resultant: partition. -Our mode of applying the attribute partitioning heuristics js based upon 


the assumption that if a second heuristic is relevant to an intermediate partition produced by 


a first heuristic (that is, the second heuristic can improve upon 1 the performance cost of the 


intermediate partition), then the second heuristic will stilt be relevant after the first heuristic 
has. terminated. This ‘assumption ‘Cs made. in erder to ‘reduce to a manageable size the 
problem of deciding which heuristic to Apply next: 


We shall consider a number of heuristics in the forthcoming sections. However, the 


. Pe grouping heuristic described in’ the next section ib the ‘main heuristic of this work. 


and we will attempt, to describe it in full detail In our experimentation, we have found that 


the combination of the pairwise grouping heuristic with a second heuristic (the ‘single — 


attribute degrouping-regrouping | heuristic) to be sufficient for the purpose of attribute 


partitioning within the context of the database management system. we have considered. 


Poir 


The pairwise grouping heuristic begins with the trivial partition p° , and generates 
all partitions that can be aie by grouping together pairs of blocks in P°. For example, 


if Aw» =U, 2, 3, 4} are the attributes of a file, the pairwise grouping heuristic begins with the 


trivial partition of row 0 of Figure | and produce all the partitions of row | of the same 


figure The ala then evaluates all the generated partons with the file cost estimator, 
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and finds the partition (call it e whose performance cost is the ‘feast of alt the generated 
partitions. ih other words, assume “oP to be the petformance cost of partition P as 


determined by the fite cost estimator, In the ‘first ap of the: treure, the flowing 


minimization is performed: i Lee 
bias Ss min "istengn Ol? 


where Pj = (AP, =» AP VAR, a ABS. Let i and, &. -be the values that. minimize 5.2). If-it is 
| the case that iP} »< CrP), then the imp over. pa a mPa. ts she pesult.of the first step, 


and the second step of the heuristic begins | with, yartition . pt Ed Phe: Otherwise, if. it is the 


_ case that C(P},) 2 C(P®), the heuristic terminates (with the trivia. rartition as the resultant 


partition). In general, the ith. step. of. the pairwise en uping bhauristic arte.with, partition 


ATT Age mms Aue!) (where My. tthe number ef back. Pihand performs the 


“cman 
where Phys (APT an APT UAL, 4 Ali! }. Assuming j and k minimize’532, and if 


CIPI.) < C(P*!), the heuristic then goes to.step i+1 starting with P= Py, M=Mi-1. This 
process is continued until.a step {say step uy is reached for: whteti C04 |) + GP) for all j 
and k. At this age no.pair of blocks can bbe found that grouping: them wil reduce the 
performance cost, and so pi z As returned. as the resuk of the pairwise grouping neuritic: | 

The pairwise grouping heuristic, may be epics fo. terms of a sattice | where each 
aoae of the lattice sl are _The top node ithe wv parton and. the bottom node is 


the “one-file” partition, An interior nede is obtained ‘by grouping together a pars of blocks of 


one of its parents. Figure | shows such.a lattice for the set of four attributes {1, 2, 3, 4} 
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(from here on we shall use integers to ‘Tepresent attributes), ; The. ith row of the lattice 
corresponds to all the partitions that could be generated by the ith step of the pairwise 
grouping heuristic (equivalently, all the possible partitions, with, which. she ielth step may 
~ begin). The pairwise grouping heuristic begins with the trivial partition and. produces all 
the partitions that can be reached by following an edge (ie, all. the _partitions of the first 
- fow): It then selects the partition in that row with the best performance cost. From that 
partition, it follows al the edges leading downwards to its children nodes. For example if the 
second avtitibe trons the let is the best partition of row 1, then in the next step. the, heuristic , 


Row 


° 


{U1}, (2) (3) £4) 


(A 2h EBA ANE BD ED UU A ARK GM AU RAD A 2, RD 03, 


SOS 


21, 2, 3}, 43} (Ct, 2, 4), (33) (C1, 2), (3, 49} 1, 3}, (2, 4) C1, 4D (2 39), 3, 49 (20) (END, 2, 3, 4)) 


3 | | 8 


‘Figure 1 The lattice of partitions. 
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would compare the first, fourth, and sixth partitions of row Q The three partitions of the 
second row are all, in a sense, desirable partitions in that they lave been derived from a 
partition’ that has previously been proved superior. ‘Specifically, the ‘heuristic assumes - that 
the optimal partition is either among the three partitions or is somewhere below them in the 
lattice and can be reached by going down the: etige from the best of the three partitions, The 
heuristic continues to go down the lattice until nene of the partitions examined in a row” 
feduce the ssbtioisiance cost. At this point, the current parent: aula is recived as the 
resultant: persion of the pairwise grouping heuristic. a, | 
The resultant partition of the pairwise grouping heuristic is not necessarily the 
optimal partition. ‘Only a smatt subset of all partitions are actually examined by this process. 
On the other hand, at each step, the heusitic: does solect the best partition among a set of 
Partitions, whose common ‘parent was, teself selected as best of a similar set of partitions. 
Hence, the resultant ‘partition is optimal: among: a- subset of ‘altthe possible partitions and 
locally optimal among alt the partitions of iis tattice, . inte -Repwise minisnization nature of 
the pairwise grouping heuristic is appizent from “the. dtucusions Sota: At on step, we | 
minimize the cost for a subset of partitions that have been selected on the basis ofa similar 
: minimization in the previous step. . 
The motivation behind pairwise.grouping:is-¢s follows: Initially, when all attributes 
are separated in the trivial partition, these-queries that request two attributes are answered 
with close tq minimum cost, while those queries requesting more than two: palaaend are very 
| conly: to answer because their attributes reside in. different: subfites and ae on different 
pages. Subsequently, as blocks of attributes are grouped, queries requesting a small number 


of attributes pore costlier to answer because accessing the attributes will bring in those 
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attributes that 8 are not omenes by the query = nevertheless reside. in the same subfile asa 
rooted attribute, while those queries requesting many. antes. of which all or some are in 
the same subfite, become less ny: to answer. In the process of grouping blocks together, & 


pomt with be. reached where the reduction in. cost OF ans swerin 


thon queries that are 
: benefited by the grouping will not ofiet the increase in cost of ansyering those queries. that 
become costlier due to the grouping This point is a local iminium of oy  Rersormance cost 
function. | . : 3 sk. he : 

| ‘The Bond Energy Algorithm of McCormick et al. (24) Js another stepwise 
minimization heuristic that may be used, for the perpor of. attribute partitioning. Hoffer ; and 
Severance (19) have tised the Bond Energy Algorithm to ‘group attributes into blocks based 
on the similarities of. attribute occurences in queries (see Conpeer 2 for a detailed discussion of 
how. Hoffer ana Severance Us) utlize the. Bond Energy Aigorith for the purpose of : 
attribute pereticoe! We peneye that our pairwise grouping. heuristic, when compared to 
the Bond Energy Algorithm, hasa number of advantages which makes it more desirable as a 
partitioning vehicle. The Bond Energy Algorithm aparaie by permuting the columns of a 
matrix danesting of pairwise stéribains access sieiiarity measures in such a way that the 
cotlinut of similar attributes fall close together. If we ook at the matrix of pairwise access 
similarity measures afte the algorithm has terminated, we will find that the attributes are 
ordered such that similar attributes are placed adjacent of nearly adjacent to one another. A 
disadvantage of this algorithm is that after this is acomplished, after such an ordering of 
‘ the attributes is found, it is left to subjective judgement to decide how to clump the attributes 
together to form blocks “The other disadvantage of this algorithm (and one which will be 


examined tt Section 4) is that the algorithm only looks at the similarity of access between 
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pairs of attributes (i.e. between pairs of blocks, each of one attribute rather than among any 
‘number of blocks containing any number of attributes. i 7 | . 

Stepwise minimization is basically a ‘hill climbing heuristic search technique which 
will not necessarily locate the- optimal solution. The solution achieved using this technique 
may be any of the local ‘minima; the closeness of the solution to the optimal parsition will 
depend on the database parameters, ne access paths: of the fle, and: the usage pattern 
_ parameters. However, in the course of our experimentation with aia pairwise grouping 
heuristic (to. be described in full detail in Section 5", the pairwise grouping heuristic re 
with the trivial partition has consistently resutted in either the optimal partition or in a: 
near-optimal partition that differed insignificantly from the optimal partition. ‘This has led 
us to believe that pairwise grouping is an. attractive heuristic search h technique for — an 
adequate partition. for the attribute partitioning problem. 

The process of pairwise grouping is actually the method of necpes descent of the 
hill climbing heuristic search technique. ‘The coordinates of a pon on the “hill” (whieh 
should be visualized ax invesed: since the search is for finding the ‘lela point) are the 
partition and the performance cost of the partition as Sebersninell by the file cost estimator. 
The distance between two partitions. is defined as the nosnber of edges on the minimum path | 
connecting the two partitions in the lattice of partitions. ‘Pairwise grouping is the process of 
following the negative gradient from one point toan adjacent point with a distance of one 
(along the partition axis), beginning at the point of the trivial partition. Gur consid 
from this srorai of experimentation has been that this “hill” is predominantly devoid of 
“bumps” (Le. local minima or points where the gradient changes sign and all adjacent points _ 


to the “bump” have a larger performance cost). The few “bumps” that accur on the hill 
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happen to be vey close to the top of the “hill” in such a way that. these “bumps” are not 
significantly tower than the top of the “hilt ite (the global minima). In other words, it 
appears that the “hill” which is the performance cost of the partitioned database is rather 
“smooth” and the “bumps” that indicate partitions that are locally minimem all occur near the 
highest point of the “hill”. 
| As we will ‘fully repent, in Section 5.7, the. performance of the pairwise icing 
heuristic is "extremely good, managing | to Improve the performancy of the database 
‘management system by a factor of 5 with respec to ‘the one-file partition (ie, the 
unpartitioned file). The resultant partition of the pairwise grouping heuristic was in all cases - 
either optimal or near optimal. To determine the extent to which the partition produced by 
the pairwise grouping heuristic is optimal (by comeeree it with the optimal partition), we 
| exhaustively. produced the optimal partition for cases where m was less than or equal to 8 
(B(8) = 4140). For cases where m was greater than 8, we either manually | generated 
| promiiing partitions, or “ppd other heuristics to the resultant partition of the Pers 
grouping heuristic. In all cases, the resutant partition of the pairwise grouping heuristic was 
either the optimal partition, or was within a few percent of it. In the few cases where the 
resultant partition did not coincide with the optimal partition, it was observed that some 
blocks in the resultant partition contained one or more spurious attributes that, if removed 
from their blocks, improved the performance. These spurious sieiiaes, it turned out, were 
inserted in their erence blocks rahee bl in the minimization process. That is, 
although a “spurious attribute resulted in performance improvement early in the ‘grouping 
process (in other words a spuirious ‘attribute was attractive to its block), later on, as rite: 


attributes were added to the same block, the spurious attribute was no longer attractive to the 
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augmented block, and degrouping it from its block or transferring it to other blocks became 
beneficial. In Section 6-we will describe other stepwise minimization heuristics, some 
intended to eliminate this deficiency of the patrwise grouping heuristic. | | | 
The major disadvantage of the pairwise grouping heurtatie-ts its slowness. Every 
partition that the heuristic produces by pinipiag' pair of blecks needs to be cost evaluated 
by the file cost estimator. Let us see, in the course of carrying out the minimization 53.2, how 
many partitions need to be cest estimated. In the analysis below, we have assumed that if the 
file has m attributes, then on the average the heuristic will iterate tor f m2] steps. | This.is 
a reasonable assumption in that the heuristic may iterate anywhere from | to m- 1 steps. 
(Our analysis will still hold as long as the number of steps isa constant fraction of mt) The 
improved partition produced ata step ms exactly oine less block than: ‘the partition of the 
previous step, ie. M=M,~-1, and the ith wep wn wart with a partition that has 
M.,=m-i41 “plotks. Under this assumption, the final: partition produced by the heuristic 
‘has Lm/2 J+ 1 blocks: there are [m/2]- 1 steps that produce an smprovement and a final 
step that produces no improvement. Therefore at ‘the ‘th ap, there will be 
‘binomial(m-i+1, 2) partitions generated that need to be « cost exirnated. On the average, the 
total number of file cost estimations will be: | 
[m/21 /m-i4l . 
Mi ( 2 ) 
The above expression ts of order m? This order is que cane, and it is desirable to reduce 
the partition search cost by doing either of the «following two Unies: I- reduce the cost of file 
cost estimation iPy evaluating partitions in some other manner, or = reduce the number of 


file cost estimations by cost estimating only: some of the partitions generated at each vee of 


fan BARRY eee SENT a nO en 
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_ the-heuristic. The next section will investigate the. possiblity of cost reduction. by employing 
- means other than file cost estimation. Section 5 will investigate the possibility of reducing the 


number of partitions that need to be cost estimated. 


Assume we have the partition P = {A, aay os Define the (n-wise) block attractivity 


measure o of {Aj ,, Aj} & P to be the cost reduction (or Increase) obtained by grouping 
the n blocks together: oe ae . 
CCEA, sy Ay P? = C(P) - C({Ay Ay 0 UAL yams Aud) 


If ety, rover Ay hi P) is positive, then partition: P.may be improved by grouping Aj nmr Ay - 


TE ot({Ay, ne Aj} P) is negative, then grouping fA), os Ai will increase the performance — 


cost. The pairwise grouping heuristic cost estimated all partitions P,, generated by | 


grouping together blocks A, and A, of P in the ith step, and selected the partition with 


minimum cost. Since c{P) was available from the previous step, cost estimating alll Pix 
where 1 sjek s Mir As equivalent to finding all the block attractivities oc(fAy » Agds P), and 
minimizing C(P;,) is equivalent to finding the maximum value for oc{Aj , Ay}; P). where 
lsj< a M1. The two blocks A, and’ Ay that akineuis of are then grouped in going 
from the current step to the next step. (They are grouped only if. ol {Ay Ay}s P) > 0.) From 
‘iis discussion, we see that a step of pairwise grouping may be viewed as the process of 


finding the pairs of blocks in a partition with maximum attractivity. 


In the previous section we noted that one way the pairwise grouping heuristic may. 


be made more efficient is by refraining from file cost estimation. That is, by computing the 
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block attractivity ‘measure fed afi pairs of biocks-without resort to fite cost estimation. 
| In order to avoid sommiaating g , we may try to find an approximation to it that is 
easily computable. An approximate block attractivity measure frmast possess certain qualities - 
of the € (ideal) block attractivity measure” « in order to guarentee, that the outcome of the. 
pairwise grouping ‘heuristte using the’ approximate measare: fn ‘the sative as the outcome ‘using 
the (ideal) biock attractivity measure (the reason we call o the teat measure is that no other 
measure can be better sian ee file cost estimation). Below we give a number of properties 
that a candidate measure for the approximation to the block attractivity measure must 
pecan possess | | | 
I- The candidate measure should doped not only on the set of blocks in the partition that 
- their attractivity is being computed, but aho-on how ‘the reit of ‘the attributes are 
blocked inthe partition. That is, the same apt of blocks may Nave a different 
attractivity depending cin the partition they are-in. Because’of this, we ; just cannot | 
compute the attractivity measure of a pair of blocks by conisidering only the pair of 
blocks and their attributes. Rather, attractivities mast be computed in the context of the 
partition as a whole. Corisequently, the attractivity measure of a pair of blocks ina step 
of the pairwise grouping heuristic depends’also on the partition reached at that step. (If 
this were not the case and the cutitate eagle epeidéd only on the pair of blocks, 
then it will become possible to redistribute the’ attributes in the rest of the blocks such 
that the ideal mesate changes and becomes of a different sign signifying that the pair 
of blocks have become attractive or unattractive, white ‘the candidate measure remains 
unchanged.) It is for this reason that the attractivities soos ina step of the 


alta grouping heuristic cannot be used in future eps 


cog ge MEE 
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? 


2 The Senaidans measure must be computed by considering all queries made to the file. 
3- . The.candidate measure should be applicable. t. blocks with any number of attributes in 

them, rather: than: just being applicable to. blocks containing a-single attribute. . 
4- It must be possible tp compute the attractivity of any number of blocks; Le, the 
| | candidate measure must be an n-wise attractivity measure. “We need this capability 
because some of the other heuristics we shall consider require the mutual attractivity of 
more than two blocks. One might assume that the attractivity of a set of more than two 
blocks can be obtained: by applying the pairwise rieasure to alll pairs of block in the set, _ 
and making the n-wise measure the sum total of the pairwise measures. This is not true 
because the pairwise attractivity measure is not transitive. For example, if blocks A, 
| ‘and - A2 ae attractive, and lkewise blocks : ‘Ag and Ap we cannot conclude that it is 
desirable to block Aj,, Az,.and Ag into the same block, because blocks A, ind Al cay 
be unattractive to each other. Even if it were the ae that all pairs of A, , Az, and 
A3 “were attractive, there is ns reason to ‘conchide that all three blocks are enistualty 
attractive. Hence, an approximate measure must also be an n-wise block attractivity 

measure. (Note that the ideal block attractivity measure is an n-wise measure.) 

"We have considered the attribute similarity measare of Hoffer and Severance [19] as 
a candidate for the approximate block attractivity measure. The pairwise attribute similarity 
measure of two: attributes (or equivalently two blocks, each containing aie attribute) is the 
ratio of the information transferred (from sechnidary siotage’ to primary memory, for the 
purpose of selecting tuples that satisfy a predicate or for the purpose of projecting a value), to 


the total information transferred from a subfile consisting solely of the two attributes (when 
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answering queries that request one or beth of the attributes). In other words, assume that the 
two attributes exist by themeetves in one subfite. ‘When a -query that request one or both of | 
the attributes is made to: the database, a portion for some queties. afl). of the subfile's 
subtuples will be transferred to primary memory (depending on the joint selectivity of the 
other attributes of the query that have aiready been resolved), and only a certain percentage 
of this portion will actually be useful data (because the first of the two attributes may further 
, feserict the oat of tuptes to the tuples that satisfy a predicate on, the attribute). This 
"percentage is the ratio discussed above. There ts one such ratio for each query type. The 
weighted average of all such ratios will be the pairwise attribute similarity measure of the two 
attributes (the weight of a query type is its frequency). From this definition we see that the 
pairwise attribute similarity measure appties to only pairs of blocks, where each block consists 
of only one attribute (Hoffer [4] describes an. extension of the attribute. similarity measure 
that works for more than two blocks, but where. each block still Consists of one attribute only.) 
The other r problem with the attribute access sirntlarity mensure is. that it does not depend on 
the other blocks in the partition; it just depends on the two attributes of the measure. For the 
above reasons, we have concluded that the attribute access simitarity measure of Hoffer and 
Severance cannot be used as an approximate measure of the block attractivities. 

Considering the above prerequisites for an approximate measure, and the fact that 
. the file cost estimator implemented Sviad dai to be rather fast in estimating the performance 
cost of 'a partition (the speed of the: file cost estimator is partly due to the rather simple query 
evaluation strategy that it uses), we decided to retain the ideal measure (Le. the difference 
between the pectorance costs of the old and new partitions) for the. purpose of finding. the | 


most attractive pair of blocks in the pairwise grouping. heuristic. sere the ideal block 
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| atraceyny measure, we Jess ah to bring out the real characteristics of the attribute partitioning 
problem, characteristics that might be blurred and obscured using a, less than ideal measure. 
Indeed, the choice of the. file cost. estimator ed the attractivity measure has proven | useful to us | 
and has alnwed us to o considerably reduce. the. number of pairwise ise attractivity measures that 


_ need to be computed. The next section elaborates on how this reduction is achieved. 


| In the previous section it was pointed out that the attractivity of a set of blocks in a 
- partition is also depéndant:on the other blocks of the partition: ‘However, in the course of 
| our experimentation, it was observed that if blocks A, and - Ap were attractive to one 
another at a step of the pairwise grouping heuristic, then Ay. and Ag. remained attractive 
- “with almost the same attractivity measure at the nest step, with. ong, possible. exception. The 
| exception is when either of. blocks A, or Ma were grouped (together or with another block) 
in going from one step to the next step. In terms of the, attractivity qeasure, we observed 


that: 
651) wclU Ag, Ay PD: eff Ay Py) 


or equivalently that: 


and yj, 


C(P) - C(P,y) x C(P)y) - CPyy,y) 2 for all x # 


where P= 1AM Pi m IAL my ALU Ay on Auth Pry = Aj, ‘ay AU Ay wr Ay} and 
Pheay - {Ay, - sep Ay U Aggy omy A, U Ay -» Ayo} It was also observed! that the attractivity measures — 


__ Of pairs of blocks retained their retative order - tiny terms of dapetade) with respect to one 
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poqes an attribute from eh U Ne then or) = = GP) and Oy = Py) and the 
| query: is sioneomatie If the query does not request an “attribute from A, v Ay then 
cP, ne CIP) and FP a = oP and the query ts non-volatile “Another necessary 
condition for a “query to’ be volttle is that grouping, A and Me should make a 
difference in the procesing of the volatie query. ‘That ts, a volatile query must be 
processed with a different method | in partition aa or Pa) compared 8 to the method it is 


processed in partition Pi (or Pinay) If this were e not the cape and the query's method 


ROPES. NG 


was the same for both P and Php then the eect that grouping A, and A, would 
__ have in the cost of Peccensing the query vi be the mine : whether 0 or not Ay and A, are 


grouped, and hence wil cancel one another out. in on This will cause the query to 


+: aos 


become non-volatile. Therefore for 3 a query bad be volatile, it must not only request 
attributes from both A, vA ‘and A, vy, In i selection component, but grouping Ay. 
dnd A, should influence the method in which the query is ‘Processed. From this 
discussion we may conclude, that the queries - tive database urge pattern, are / 


piedonnnantty non-volatile. 


‘To reiterate, the observation 5.51 is aoe the queries in the usage pattern are 
predominantly non-volatile, and if a query happens to be: volatile, it: does not eran 
determine the attractivity of As and Ay. 

Observation 5.5.1 has i a mene. of portant Amplatons First, it says that the 
pairwise attractivity measure e needs 6 to be computed for all pairs of pacts ony once, and that 
the attractivity computed remains valid in rubsequent steps as tong as neither blocks of the 


- pair paricipates in a grouping. Second, it ays that. the attractivities which need to be 
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the sorted list that contain either one or ‘both of the piocks participating in ne grouping. It 
then computes the pels pebecahdaa | measures between the newly created block and the rest 
of the blocks in the partition. The newly computed aractivity measures are then sorted and 

merged with the sorted Hist. sad most attractive. pair of blocks | is then selected from the top 
| of the ist and grouped for the current The above process is then repeated on the new 
Partition. 

Our experimentation with the fast ‘pairwise srovping heuristic has been very 
successful. The. fast palates ssrouping ‘heuriati consequently produced the same final 
partition as ‘the pairwise grouping. heuristic Ushi in all the tests pe tormed: 
Furthermore, it was -drastically more efficient than the: - pairwise Brouping heuristic. ~The 
extent to ‘which the on pairwie grouping heuristic i more efficient then the pairwise 
grouping heuristic Increases as m Increases and as de umber of steps the heuristic iterates 
: increases (the two heuristics iterate for the same mumber of sep sahbid ny produce the same 

final partition) An our experiments, the fast pairwise grouping —— was. anywnere 
between 1.6 to 5 times more efficient than the pairwise grouping hearistie. Because ” its 
aC vamtages we have epee the seta pairwise grouping heuristic instead of the pairwise 


siete tals heuristic, as one of the two main heuristics of our attribute partitioning ‘system. 


6. Other Variants of the Stepwise Minimi zation Heuristic 


In addition to the pairwise grouping heuristic described above, 2 number of other 
heuristics were developed and subsequently. programmed: and tested. We wilt describe the 


following: heufistics: 
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the k-partition pairwise grouping — 

2- the two most attractive patie grouping pesristc, 
3- ‘the triptewise grouping heuristic, 

4- the binary deyrouping heuristic, 

5- the single aetribute degrouping frewriettc, 

6- the double attribute degrouping freurtetc, 

7- the single attribure degrouping-regromping Neuritic, 
8- the double strioute degresping-rapromping teria, 
9- the single attribute ungrouping heuristic. 


We will abso Sige the overt reats of experiments cidade im fleent viranments, #0 


that the cost effectiverress. of each heuristic (ie. the epeatny of shee reno parton when — 


measured againe the effort required to produce t) may 0 east ‘Plemttictc 1 - 4 wn Pare 
" intended as alternatives vo tive (ast) peirine grouping tremtt, setae treaties 5 - 8 (which 
dégroup blocks in each atep) are Intent ores pen hn ree pram at tee (fast) 
pairwise grouping ‘heuristic. 

The k-partition pairwise grouping thei - This fem a wh extension, of the 
pairwise grouping heuristic. In the first step Of thts trewrtetic, the a partitions with ‘the best 
' performance costs are selected and ro the next ~~ Ean en oe neceives 
partitions and generates all possible ‘partitions: that cm ‘be brated from therm by pairwise 
grouping of blocks. Out:of the set.of: stntent & ape-adlected and the 
process above is repeated. The pairwise grouping dunusiatic $50 egeteidl cane aif thre k:parsition 
pairwise grouping heuristic where <1. Therefore, ‘the tpertinn ipttetse grouping 
heuristic will almers Tesuate in a partition that ts at ten as Optaal 2s the: reactiennt partition 
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of the pairwise grouping heuristic. The drawback of this heuristic is its search cost, which. is 

| rooghiy k times that of the pairwise grouping heuristic. In view of the near optimality (and 

in most cases the optimality of the pairwise grouping heuristic, it is not possible to justify the. 
additional search — rat this heuristic incurs. 

2- ig two most attractive pairwise grouping | heuristic - We noted in Sectian 4 that 


block attractivities remain relatively unchanged from one ben 20 of the pairwise grouping 


_ heuristic to another if none of the blocks participate ina grouping. We also noted that the 


attractivity of pairs of blocks retained their relative order from one step to another. Hence, in 
each step | of the pairwise grouping heuristic, tt is kel ¢ that the few most attractive pairs of 


blocks will be eventually grouped in subsequent steps, and tf erel . 


we can deduce that the 
two most attractive pairs of blocks in a step are good candidates for being grouped 
| ima a Tne in one a ee: 

Having this in mind, the two most attractive pairwise. grouping heuristic was 
developed. This heuristic takes the top two most attractiye pairs of blocks and groups both. 
aii in one step. (If ‘the ‘ss two most attractive pairs of blocks have a block in common, 

then all three blocks wit be grouped in one step.) Each subsequent steps takes the partition 
of the ls step and simultaneously groups the two, most attractive pairs of blocks, Grouping 
the two most attractive pelts of eens has the advantage that the number of steps required to 
get the final partition is reduced by a factor. of 2. However, this heuristic has the 
disadvantage that as the number of attractive pairs diminishes towards the end, grouping the 
top two baits ceo too arbitrary a thing to do, It the first best pair of blocks and the 
second best pair of desma nave any block in common, then itis possible that the uncommon 


: block in the second pair is not attractive to the new ‘block formed by grouping the first pair 
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(as noted previously the - pairwise: attractivity ; measure is ssaad seaniaeave). ‘Another 
disadvantage with this heuristic is-that, even if the two: top pairs ‘of blocks have no common 
blocks, it is possible that after grouping the first. best pair, another ‘block in the partition 


manifests a greater attractivity to the-newly crested :blotk- than the surveteiry os the second . 


best pair of blocks. Indeed, tests ‘show that’ ‘the:-2we: mont: airectioe patrwiee grouping a 


heuristic produces the sante partitions én: the fie Sopa a-emrs pro erouping 


heuristic, but later on it diverged and produced tis epeimal partion. 
3% The triptewiee grouping, heuristic. « Pht Semmarhatics tte thted ‘sttriation of the 


- pairwise grouping heuristic. we have considered. Fae groupe titpies of blocks at 
each step ‘inatexd of pairs of blocks. ‘i the palewten grouping bre generate au the 
partitions ‘by grouping: paies of blocks. : ‘The-tiptewiar: grouping. ‘wort fponerates ail 
pction by grouping triples of blocks ant: them -seloats the partion with ben performance. 
A namber of particioning, problems were solved: using this teria: inal cases, the 
triplewise grouping heuristic: prodixced * nonoptimnal partition, ‘and in ‘almost every. case, “the 
resultant partition was less optimal than the palrwie grouping temic partion. | | | 

The inferior performance of chs neuritic ay te sentbuoe tm chetarge number of . 
locks it tries to group in one step. Ching ues blacks simusneoudty sn one sep. 1s 
somewhat too coarse an: action. For exampte, ‘im sone experiments we observed a set of three 
blocks A,, Ag, and Ag" teat wre al patriarch npn ae another, but if 
A, and A> were grouped tagether, “Ag. wak-no donger: attractive to the black a Ap (ez. 
wcl{Ay, Ash Py>O and ‘edl{Ays AghP) > 0. and iy Mg Ph> 9, feat elf Ap Aah P)s-0). 
The triplewise grouping heuristic aho suffers from the drawbacks af: the te most attractive 


pairwise grouping heuristic mentioned above. ‘Finally te tplewise grouping: heuristic 4s.on 
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ane. order of mo slower than ine pairwise grouping beurite, in. that it requires 
| binomialin-2 42 - attractivities to be compared fn step i 

“4 The binary degrouping heuristic - This heuristic resembles a 1 logarithmic. search. 
It ae with the one-fite partition and tres a} Givide | & gto two Rdseged hd that the 
performance cost it Improves. Ae a s accomplihed, ihe pearintic tries to divide each of 


the blocks i into a A pad: of smaller blocks and w on. "Cian a block, is , accomplished by 


} finding all atributes that if excluded from the beck, result , of.th the performance 


he 2 ms! Se Genet 


, cost tle finding an the attributes, in the block, that are unatt active | a 


maximum of half ot these Unataerive: aitribupes_are 


Ley “Pats fash 


separate block of their own. Although ¢ this heuristic ts very fast, in that it ts logarithm, it 
Ercan sue poorly, and tended to produce j a trivial partition at the end, One may . 


speculate | that this heuristic can be useful f for a . weer. astern wher. ‘the, optimal partition 


PE ce 


consists of 2 or 8 blocks, sa odie ap td de santa Pos | . 

| cs The single: attribute _ degrouping heuristic - Th heuristic 7 with al the 
remaining heuristics are re degrouping heuristics, A degrouping heuristic takes a partition, and 
| separates out one or more attributes from one binck, of the Partition. . An, terms of the lattice 
model of stepwise sninimization, the degrouping heuristics amount to going up an edge, from 
a partition in one ed of the lattice. to another partition in. the ow directly above. Except for 
heuristic 9, all the degrouping. heuristics are intended to be used. to improve upon the 
| partition,that has brepege been bahia at bY the process of pairwise grouping. By using the 
sage pe grouping neutints ane the degrouping mak actos alternately, we try to further 
optimize the resultant male When the pairwise grouping. heuristic bes the degrouping 


heuristics are aephet akernately, the process - f grouping : and A degrouping blocks may be seen 


rea GRRE 


sastudistis Ha geibrit, 


or, 
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improvement. Again, experiments produced negative resuks and in none of the cases tried 
did it improve on the partition produced by the pairwise grouping heuristic. | 

| a The Single attribute degrouping regrouping heuristic - This. heuristic eliminates 
| ‘the drawbacks of the previous degrouping heuristics by. degrouping an attribute from its 
block: and simultaneously grouping it with another block, all in one step. This strategy is 
based on the reasoning that an attribute which has been grouped fairly early in. the pairwise 
grouping process must have had a high attractivity in the first place. It is unlikely that this 
attribute should exist by tet in a separate | block. More Mikely, in se later steps of pairwise 
grouping, it becomes, more attractive to other blocks than to, ts current block. Indeed, in the 
few cases where the pairwise grouping partition was not optimal (as compared to the optimal 
partition found by exhaustive enumeration), the optimal partition could have. been obtained 


from the pairwise grouping partition by transterin 


one oF more attributes from one block to 
another. In terms of the lattice of boning partitions, this amounts to a traversal from a 
partition of one row to aeons partition o of the same row by. going up an edge and coming 
back down another edge. | 7 

The single attribute degrouping-regrouping heuristic first selects the attributes that 
reside in blocks of two or more strbutes shed ch an attributes, it Computes the attractivity 
measure of the attribute with all the blocks in the partition except for the block ‘that the - 
eueinule currently. cereig in. It then selects the attribute that is bina attractive to some other 
block and groups it in that block. Tf this improves the performance cost over the 
performance cost of the current partition, the heuristic es nar on ms improved pare. In 
the cases where the pairwise grouping partition did not oe with the optimal partition, 


the Single attribute degrouping-regrouping heuristic when applied to ) the resultant partition 
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of the oslivne grouping heuristic, resulted in the oual pardon. (Observe that if applied 


to the optimal partition, the single attribute tegro ¢. will immediately 


‘terminate and fetien the optimal: part tie n) hd " jowk to determine the single attribute 
degrouping-regroisping hheuttetie’s search: cust: Alb we. heaerta takin hones where the 
heuristic was applied, it did net require sme then chees:stejs-tw conchodie.- 

“The fast - prirwiee grotping hrewristte tn -conaliontion.-sigh the singe attribute 
degrouping regrouping heuristic has proven, superieh s0-eil the ether emuriics a a result of 
our investigations. Gonwequientty, we have. ‘select: then ast: ~— ing. heu : 
conjunction with the single attribute deqrouping-serooping shame 03. she. chétce. a the 
partitioning heuristics for Se tingeiaiene) a 


8- The double attribute de 


- to'the single attribute & 


. This heuristic has the advantage pees beaks —or Le’ 
that: degrouping-regrouping ‘any. one of two attributes in one mp may not result in an 


improved: partition, witlle de rouping-+ ee dei war 


produce an inproyeeent. Tn none of our experiment id ‘we run ito such 2 situation where 


ss £ At ehh Efe o@ vs Sheer pags 5: 
double attribute degrouping-regroaping produced an “improvement while single ‘attribute 


Ae UBB Rm E SY UP op gtr eae see Eee 


degrouping regrouping did not ot produce any oe Also, this heuristic saffers from 


Pye “te ae beh wage ED oh tale 
being too coarse, as was “the | “case with, “the triptewise "grouping heuristic. if 


Beet ect PRELES VERSED oo 


deerme ease jos a single attribute ist heuristic “wilh. not detect that. 


bps ad: ieee warner go: hale see 2s 


in faver of the singte 


agls s Yraapg hep eee 0 27 spereye 1P, cee 73 ayn 


attribute degrouping- regrouping heuristic. ona 


Hence we have dropped this. ‘heuristic from 
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9- The last neuriecte that we developed and tested is the eats attybute ungrouping 
heuristic. This heuristie is a » Segrouping, heurlatic. in, that it, attempts ta: degroup attributes 
from their blocks. The ingle attribute "ungrauping, heurlati, performs the functions of both 
the ‘single attribute degrouping regrouping. heuristic, and the. single, arate. degrouping 
heuristic in one step, and it is thus superior to. both of them: When an. attribute is degrouped 
from a block ina step, it may be regrouped. back with other blocks, or placed in. its own 
block: in terms of the latice of attribute partitions, (some of the) partitions reachable from 


the last partition by elther an upward edge (to directly one row. above, i.e, degrouping). or, all 


partitions that belong to the same row and are. reachable. MY, foligwing an a up and: 


_ another edge down (Gegrouping-reqrouping are considered. - 
e le attribute dexrpuping heuristic does not.turn 
out to. be advantageous if. applied to the reyalang. [Partition Of the: pairwise grouping 
ae grouping kewristic, the single attribute 
ungrouping heuristic. will only, be as good a the single autcibute, degrouping-regrouping 


_ heuristic. The single attribute ungrouping heuristic is thus meant as a stand alone heuristic: 


We mentioned Previously that the sing 


heuristic. Hence, if applied as a. sequel to the pairw 


Tt is initially applied to the one-fite Partition; the minieization process. moves. ypward from 
the bottom of the intilee until the optima! partition or a near optimal partition is reached. In - 
this vense, the singfe attribute ungrodginige heuristic is the itiverse process of the combination 
of the pairwise grouping heuristic with tht ini atirbiw degrouping-regrouping heuristic. 
Experimentation with this heuristic hai gtherally bedn satisfactory. In the smaller 
attribute ‘partitioning. probtems (in terms’ of the fromber ‘of attributes), the single ‘attribute 


wiprsaping heuristic has: been able to locate the optimal partition. ‘In the larger problems, 


the heuristic:did not. perform as-Well as the pairwise gtouping heuristic, and resulted in a less 
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optimal partition. | 


| The drawback of the single anribute zn eae ; 


of the heuristic, all the attributes in 
at each step these attributes have to be reg cuped wilt all of 


to assess the order of search cast. for. this heuristk, 


ana Aa es 8 any | 

partition evaluations as the pairwise 4 rauping- “and the tingle auribute 
| , Fo mom parthtaning- problems consklered, 
the optimal partition was much cloner, both in Wr he saan at te latsice. ec 
partitions and in terms of performance cpa, 0. oe is +s peecien.y | 
danse . isa . a 
fnuch smaller to the top. of the, tattice thar to. the bs nib Satin. ‘epee 
additional advantage that *. stepwise: taink nization ‘hewi atic at warts froma the trivial 
ie that starts ram hm arition. 


degrouping-regrouping heuristics. combined 


Partition. Le. the distance, ity seems OF the auay 


‘partition has over a heyris 


The effectiveness of our approach, to. attsabute partkioning wa ‘se orgalaing 
database system can be fully evaluated only by employing it over. an: exiended. porto of. tiene 


in. an operational setting. However, it is possible to pbtain soma pariah surements of these 


techniques by means of experimentation in a contraliaden ronment. 


We have conducted an extensive program of cxprcimigton wh the. pairwise 


"grouping heuristic, the fast anim romping he and hall single _aitetbute 
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degrouping regrouping peuribtic. We se also performed, a number of tests on the other 


attribute partitioning heurais The overall result for each heuristic was i sakes in the 


‘ BR 


discussion of that heuristic. In general, hone ian the attribute. partitioning heuristics 
performed + as : wel as the (au pairwise srouping, houriete, and none of the degrouping 
heuristics performed as well as ‘the: single attribute _degrouplg. regrouping heuristic. 
Henceforth we will ‘concentrate on. n, providing a detailed description of the experimentation 
performed on our. “main heures and provide an assessment of the p performance | of these 
heuristics in compariton with the exhaustive Srcnstanboorn procedure. The exhaustive 
enumeration procedure performs a an exhaustive search of the fon BRACE, of possible, uaa 


and so is guaranteed to locate the true opine partition. 


- 
rr 


Lary 


“Each experiment was bconesned with selecting 2 an opt parton for a particular 
database in the context of : some ‘ven unge ‘pattern. In each. experiment, the pairwise 
‘grouping heuriate and the f fast pairwise srenping Melee were tried starting with the 

Hsien gs 
trivial partition. Thereafter me single — dagroping-reqroaping | heuristic was tried 
starting with the resultant partition of the pairwise ‘Srouping heuristics (ie. the pairwise 
| gr ouping. heuristic and ‘the fast pairwise ‘srouping heuristic). If the single attribute 


degrouping-egrouping heuristic was revere a that Point, rei the pairwise grouping 


or 


Ieuan taties were retried _sarting with the resukant peretien: ‘of the single attribute 
degrouping regrouping heuristic In aadition, using ~ abet enumeration procedure, 
__ the optimal partion was found for the same set fit database parameters and’ the database 
usage pattern. | . . 

In an » operational sting, t the usage pattern es to an SGHDUtE. + partitioning 


system would be based on ‘historical records of database ‘use; ino our 1 experiments, we. generated 
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these patterns ourselves. ‘Naturally, a fall (Progeam of swthanietion would —— 
performing experiments for the full one of deta charncriiy and ngs paterns since 
this was ‘obviousty infeasible, ¥ we had to restrict curses fo «tne set of cass. Setecing 
these cases ‘posed a ‘real dilemma. ‘Weald not want to coer sham random, fr we might 
“thereby focus ort urtintéresting or wnrepreenatve example: at Af oe selected. there too 
carefully, we would potentially sacrifice some of the vaday snd goeraity o oar ting 
“procedure. in  oniequenc, we sted om the fling approach, "For each experiment, we 
provided a set of ‘database parameters and. rs ae. of ther prema that roughly and 
succinctly characterized a ‘perticular hind of umnge peuen. Tove sccinc omee pattern 
pacanseters were then used to drive a omer aterm gern, pei in nature; it 
‘randomly generated a set of. ttre ramen acandanen wth te gen parneneters 


and then summarized them ima mene ‘The ene sts gener» was. 
then used as the usage pattern for the experiment. i ays | | 
in all ‘our experiments we have hap 0 ofthe bene params fed: n, the 
ener tuples in the fife, was set to 100,000 and S, the page sie, was set to 1028 
words. We believe these. figures reflect. type values for real dagplenses: ry any event, the 
behaviour of our attribute partitioning system should not iter radia for problems wit 
different values for fn and S. Also, we are permite to constrain a lest one ‘database 
. parameter without reducing the degrees of freedom interent in our wt prem, 7 
The rents of our experimentation are pce fe Tables I throughs 9 Our most 
extensive series of experinents were © conducted with fles of 8 and 22 aserioutes, The ling 
factor in our experimentation was ‘the use ofthe exhaumtve enerneration procedure Even 


for a file with 8 attributes, the number OF pone parton nthe sch gece 10 and 


cE Se NRCG a SAREE TET et 
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using, the exhaustive sai aidacaeas procedure to find the optimal partition. for the problem is 
; a nontrivial task, one that can only be sccomplahed with difficulty. For problems with 
m> 8, we have | not ¢ been able bes use = — Seopwecntpneagl Lsheec gah to verify | the 
: performance of our attribute pertoning heuristic, However, we can. reach a number of 
- Poeciinona concerning the behaviour ih the heuristics for. problems with a mM, _ by 
| extending the resuks obtained from problems with ms8 . | 

: Table i shows the characteristics * our experimentation. Each row shows the 
number of experents arid out for an ato partioning problem with 5, 6, 7, 8, 15, 22, 


or 90 attributes. The average n ee of coromertine query ‘types and the. average number of 


Number of Number of 


‘attributes __Averoge, numonr 


f 


een sensber ewcage ove-the Averege trivial 
- Si sinlunctions Secprars partition cost? 


Cieza & | 10 | a eel, mts ae ween e | 0,708 
6 10 8 4 1.0 - 9.ag 
7 ae a aa 10 =——~*«é«i kB 
8 — 20 ) (ie Lo | 0.499 
AB 10 a5 Lis | ee es |: 0.189 
220 20 70 12 “1022-22 
30 6 6 25 49 (S08 


Table 1 Characteristics of the experiments. 
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disjunctive query y types for each row is depicted in ‘ofan . and ‘ of the table. The fast 


‘column of Table | shows the reir cost of the tet eon with resp to the 
eos a, 


“‘one- -file (unpartitioned) performance cont, “That fe sri parcloning problems with a 
| certain. m, the. ‘ratio of the cost of the trivial partion to the ct of te enperchioned fite was 
Ccornpuiteds and column ® of Tabte I shown the average of wch retin for exch m. - the 
remaining tables, all columns spree yd ta ger et is the 
average of the rattos with | resect to the one partion. | 


Table 2 shows the resu of experimen seth mos “The epee partion found 


by exhaustive enumeration and the partion found by the pairwise grouping Hreuratic and | 


the fast pairwise grouping heuristic coincided in all of the ten experiments (as indicated in 


oy partition was 0.59, The heuristics iterated on the average for es This mehodes the last 
> step where the hetiristicy: contd not improve epon the last partition and’reterned the partition 
found by the previous step. —— 4 oo 


Number of attributes: 5 
Number of experiments: 10 


Average cost of, best ‘Number of steps 


pattition found - ie ced 
* Exhaustive ‘ 
‘enumeration 0892 e 9), ga * 
Pairwise 5 . 
grouping 0.592 25 ae 0 
heuristic : oe te ahi ¢ 4 
Fast . 
pairwise a 
_ grouping 0.592 — A 5 ae Ea ge B 
heuristic See 


Table 2 Result of experiments with m= 5. 


Chapter5 = == +The Attribute Partitioning Heuristics 


“Table 3 and 4 dpe the re o-eperiets with» m= Gand 7. “Again in all the 
"cave tt, eae pis me tet ai 
: fast pelreive growping heir 

| Tbe aa sperms pret = 8. “As indicated 

or can 5 te wi gmap mit he ple gpg, eure 


‘Neher ot ‘attributes: 6 
Number of experimente: 10 


sarisan ste of beet Number of etope. |" Number of times optimal: 


eee xe a 10: Hors % gi cM 
enumeration — 0.434 Hs aghr ae, 205. ee ee 


grouping oY cn | re Sats 


grouping = ‘0.434 Pee eee si 2 me 0 ~ 


Table 3 _ Result of experiments with m= 6. 


Number. of; sttribubes: tae 
_Number of experiments: 14 g, cathy ee 
ey: erereee cost of weet Number ‘of steps ‘Number. of times optimal 
Exhaustive Be hg 
Pairwise , ; | oe, 
gtovping ONT a rn 
heuristic ae ‘ 
Fest | 
pairwise : ay . : 
uristic 7 Ee . Re 


‘Table 4 “Resukt of experiments with m= 7. 
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Number of aftributes: 8 oad 
. Number of experiments: 20. — ace ak . 
Average cost of best Number of steps — Number of times optimal 


Exhaustive | 


enumeration 0.4657 ee, ee ie 
Pairwise 7 ee eee 
grouping . _ OMFS a 
‘heuristic ; 
Fast , : 

pairwise . . Baas oaks o ny. 
grouping 0.4657. re Oe i 
heuristic be? teats 2 Wis. dae 4. 


‘Table Sa Rewk of experiments with mee 


Number of attributes: 8 _ 
Number of experiments: 1 


_ Average cost of beet Sever See 


- partition found” 
Single 
attribute 
véaroupine oe . 
heuristic 


Table 5b. Result of the single attribute with: mB 


the optimal partition: in all but one-case. Talte.Ste shows th. sesute-o: applying ‘the singte 
attribute degrouping-regrouping heuristic. to the rewtant pertnton ‘of the pairwise grouping 
heuristics in the one problem where the patrocve grouping freurtates tid: hot find ‘the: optional 
partition. The single attribute degrouping-regrouping heuristic found the quttmal — 
in 2 steps cwhere the second step did not finan Improvemen-the-res ote: “fire: step). | 


The ratio of the resitkant partition of the e single attrib: pegrouping freuristic to 


that of the pairwise em" heuristics was 0.9999, Levthe focal: miedmen fou by ‘the the 


beolon SERRE TReaer sheer Te 
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pairwise grouping heuristics was within 0.0001 of the optimal solution. _ Somes 
superscripted with ‘2’ is the performance cost wih — a the pettormance cast of the 
partition found by the pajewise grouping | beusiatignin the préwious ye table), ° 


Table 6 shows the results for problems with m= 18. In this series of experiments 


and in all the experiments, with larger m, the exhaustive enumeration procedure hag. become ; 
_Anteaible (for biti tae) Bus) & 1.38 * 10°). However in all the experiments with m= 15, 
the pairwise. grouping heuristic and the. fast pairwise grouping ‘heuristic coincided in their 
resultant partition. The ayerees cost of the ben Partition found by the heuristic was 0. 15 of 
the cost of the one-fie partition for his set of expereent Also, none of the degrouping 
_ heuristics were relevant at this stage. | 
Tables 7a and 7 show the resuk of 20 experiments with me 22. In half of the 
experiments, it was polisible to improve upon the result of the pairwise grouping heuristic 
(which always coincided with the resuk of the fast pairwise grouping. heuristic). The. single 
attribute deg itupiig-teqroapibe heuristic was applied in these 10 cases, and the partition 


Number of attributes: 15 . 
Number of experiments: 10 


Average cost of best Number of steps Number of times a better 
Exhaustive * 
enumeration cae es A ae 
Pairwise : ; 
grouping. .- 0.1504 83 : ~ 0 
heuristic. 
Fast 
pairwise , 
grouping 0.1501 . BB oo ce mea ¢ } 
heuristic . 


Table 6 ‘Result of experiments with m~ 15. 
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Number of attributes: 22 
Number of experiments: 20. 


Aso se8: cot of best Number of etepe 


Exhaustive - 5 
enumeration - “4 : . 
Pairwise = ; : : en 
grouping — ~ + 0.1010 16.8 10 
heuristic 
Fast 
pairwise ’ foe Lar ee ee 
grouping | 0.1010 106 i 0 
heuristic | 


Table 7a Resa ef experiments wah m = 22. 


Number of attributes: 22 
Number of experiments: 10 


Average cost of best “Wome it sl ‘ ia Humber of tines «better | 


pertition found © 
Single 
attribute a . =i oie tt ee 
dégrouping- 0.9973 a8 . e 
regrouping 
heuristic 


Table 7b Result of the single attribute degrouping-regrouping feurtstit with m = 22. 


other (pairwise grouping or Gegrouping) heartettc could improve upon R. Tire stile 
attribute degrouping regrouping feurietic froraced for an avenge of 26 nips The 


performance cost of the partition found using the ‘singhe aterfoete Sagrenping:- regrouping 
heuristic was on the average 0.0027 less than that foond by the patria grompingy hearhities, 


reached by the ‘single ‘attribene é 


Tables 8a, 8b, and 6c show the resuk of 6 xperien 


: wath. me 20. — of the 
experiments, the = attribute degroveng groping. heurietk: was able to improve an 


sy nintin tea LS tees fant oy te digs chp cedars Saha ceed 
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Number of attributes: ae 
Number of Of experinients: 6 


enumeration - - -— 
grouping. 0.30570 21.2 a 
heuristic | a ea he . 3 


Fast 
pairwise es Pa . Se Be Bead SB Bispigvdgl SEE astro 
" grouping “0.90575 | 21.0 5 
heuristic. ; ‘ 


Table 82 Result of experiments with m= 30, 
” Number of attributes: 30 
_ Number of experiments: 5. eo? es 


Average cout of best Nember of steps, : Number of times a better 


Sia. 
‘degrouping- 0.9929 23 i. 2 


 ‘Tutbie'8b—-Resuk of the single attribute degrouping-regrouping heuristic with m = 30. 


Number of attributes: 20 
Number of ex 3 


Average cost of best Nuwber-ef steps. Number of times a better 


Pairwise toe ode boas CLD oh 

grouping ua 0.9995 2.0 0 
Fast ie ts 8 

' grouping 0.9995 ; 20 20 
heuristic Ree Bag, eon t eS 


_. ‘Table 8¢ .. Resuk of reapplying the pairwise grouping heuristic with =m = 30. 
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average of 0.0071 on sited boris sproduced by: the: pawn erompng boi -in2 out of 5 


cases where the single attribute degrotiping regrouping hhousiatic wa rad RE —_ 
grouping. heuristics when reapplied starting : ‘with: -the final, partition. of the single ateribute 


degrouping-regrouping heuristic, produced «partition with an average further iroprevement 


of 0.0005 in performance cost.. No other. heuristic could imprava.on the: partition found. by 


the reapplication of the pairwise grouping. heuristics. ‘Fhe column, superscripted ‘by, s ‘in 
Table 8c indicates that the figures in the column isthe ee with. respect 


to the partition found by the single attribute degrom « in Table 8b, 


Table 9 depicts the execution time statistics. . For ail of the heuristics considered, the 


number of partitions evaluated. increased as the ‘arabes al seertieuten Of the file increased. 
The exhaustive enurontation procedure: id thse rate dah Wbinie 0 


the ‘pairwise 
grouping heuristic had a. drastically -dower sate. The fat puree groping trratietip tad an 
even lower rate of increase than the pairwise: creeping erin temas perions 
required for evaluation... fs. we shower by. she: pew 


evaluations for the exhaustive enumeration procedure: ts appreninuntity 2 thé order-of -m”, 


for the pairwise grouping heuristic on the orderof wm, and fer mations § hiewh " 


heuristic on the order of .- m2), With me 90; the Fast: paler e 
roughly 1/5 the number of. partition evaluations requited. vr the exhaustive Sheeran 
procedure with m=8. This és a significant: improvement. ote that the; pemceasing. ame 


ts snd: als0'0n the 


required by a heuristic. depends both.an the ‘fumber:of partitions eval ate 


number of query types in the usage pattern. For exampr, the processing time for the fast 


pairwise grouping -heuristic:is-on: the-order Of, the prothacy-of theinuriber ef query types in : 


the usage pattern and the square of the numbér of attributes.in the file... © 


Chapter 5 


Exhaustive enumeration: _ 


Tt es 


Pairwise. grouping heuristic: 


Number of attributes * 


cosette Pees Figesiiy oS 


BA. 
203 

. 877... 
4140 
& 4.38.0:10° 
% 450 « 10° 
& 8.46.0.1077. 


_ ‘Average number of 


Fast pairwise grouping heuristic: 


Number of attributes | 


BN_Zeuen 


‘Average: number of — 
partitions eveluated 


16 

al 

31 
175. 
791 


Table 9 Execution time statistics. 
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ogo processing 
68. 


285 
268 
2212 
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The result of these experiments may be suernena ried as fiom: For problems where : 
m <8, both the pairwise: grouping heuristic andthe fast pairwise grouping. hensistic. found 
the opHinat partition in all but one case. The. optimal partition found was significantly 
superior to the ome tie Jaane in all of the cases } bested. In the one case where the ec 
grouping. heuristics did not find the optimal Partin, the partion that they found was 
within 0.0001 of the optimal ‘partition. In this: case, the. angle attribute 
degrouping-regrouping heuristic when starting . with the final partion of the pairwise 
grouping heuristics found the optimal partition by the transfer of only a ‘angle aneioate from 
one subfile to ‘another. Altogether, for ms 8 where the opi partion cout be determined : 
by exhaustive Pyne: the combination of a. pairwise grouping heuristic and the single 
attribute degrouping-regrouping heuristic v was always able to find the optimal partition For 
problems where m> 8, it was not possible to dict verity the outcame of the pairwise 
grouping heuristics by exhaustive enumeration, “However in most : of the problems tried, the 
pairwise grouping heuristic found a partition that no “eter degrouping “heuristic could 
improve upon. In addition, the partition found by the pairwise roving, ‘heuristics was 
always Eat superior to the one-file. partition. ius than: ha the experiments with | 
m>8, the single attribute degrouping regrouping -euria improved. poe ‘the rekon 
_ partition of the pairwise grouping necristhc and in these cases, ‘the improved partition 
differed insignificantly from: the resultant partition of: ‘the pairwise grouping ‘heuristics, 
attesting. to the desirability and the near-optimal of. the palrwise grouping heuristics. In 
the very few cases where the palrvetse grouping ‘neuristics were relevant.t to ‘the partition - 
found by the single attribute degrouping regrouping heuristic, the. corresponding 


improvement in. the Performance cost was negligible. AN this indicates that the combination 


Chapter 5 aie 17 The Attribute Partitioning Heuristics 


“regrouping heuristic 
| (when applied akernatay converge raphy y ‘desirable sqution. If we can extend the 


ie 2b We 


of a pairwise ‘Grouping heuristic and ~ single sribute ae rou iF ; 


results for cases caus ms8 fo the cases a, m > ” 8 we Bact y conclude that the optimal 
partition does not differ at all or does not differ significant in performance cost from the 
revotane: econo eee ncaa 4 paicwiie’ proap ) 
A number: ‘OF obiervations ‘were ‘itso mnie hires the experimentation concerning | 


- the behaviour ofthe atrtbute pent wi: Spain, we comment | on ‘the following: 


and ysage patterns 


I- A numberof experens wer performed with dab arameter 


.. That is, we took the 


ea Sp ate 


eo and. we. changed the database 


that were _Yaviations of the parameters of an caper et 
database t daeaais and the usage (Pattern of n Om tes 


parameters such as the set io available  Indcen and the aarioeny tengths and selectivities, 
the number of attributes, He nara oof tpl, and the priest: were kept. constant.) 
_ We then opened the attribute, partitioning | heuristics (and the exhaustive enumeration 
‘procedure, when posible to. these Variations to observe how the optimal partition 


differed form one problem to anather. kee oe On agin the database parameters 


‘such as the _ attribute lengths and, selectivities, sometimes even radically, did not 
| significantly ater the optimal partition: at most one or two. cari! moved from one 
subtle to anther. However, changing | oe vege, pattern, such a. the frequency of query 
types or the componion of the or * of the: quer} types, sometimes ‘drastically 


altered the optimal partion. “Hence it apperrs that: in our environment, ane ene 
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_ optimal partition. In addition, the. er for 


partition is more sensitive. to the usage patern and can terative ° the database 
parameters, and thus that the usage pain is more & determinant of the optimal 


partition than the other perement 


It was: observed that, the larger, the number of avaiable inten for an ‘attribute 
partitioning problem, the larger the subfiles oth, prin parton Aha. nae attributes 
were clustered together in the same publ ade mater the oumber of sublites in the 
jaa ore. sou, ot te agen parton (nnd ais0 
that of the trivial partition) became closer to the performance cost of the one-file periton 
eee ee eee ee ee 


as the number of available indices, were incrensed. We provide the following as an 


ee 


svar as Be ee ge 2 PCT 


explanation for this ablervatlon: oa an ‘attribute és indexed, the index. will most. Weely be 


pete Rew aig way & ome PES a Cipsy Bene Pas 


used to resoive that attribute when the attribute is requested by the selection component 


Ge hig Rep Fey ay) gigi Setsitys 


ofa query. I the attribute Is not indexed “then the attribute has to be resolved | by either 


ara 
38 


cbt tka 2 APSA: “ge aay er, EEO 
sequentially searching or by ilking, and any other atiribute that exists with. bia attribute 
To a TRE: aoe ete ee. ede ct 
in its subfile will ‘ako: have to be retrieved in its entirety. This will incur extra page 
mics debes ve cal pees Pee eRe ei Bian qe une ; 
accesses. Even if the other’ covexiating atiribute was requested: in ihe sans component 
iat aga’ ee Gelbpaneh oy tae opens 
of the ‘query, it is always easier. to Wink to the co-existing attribute (rom a reduced TID 
: hee a, ana decane: nt erie Sees me O94 
list) in order to resolve it rather than sequentially search the co-existing auribute in its 
Bet  eersbge Oe REP ged ote 
entirety or link to it from a larger TID Wat, ‘Therefore, having | an index available o on an 
Layee Pew at wee ee: 
attribute will in most cases 3 eliminate the ‘peed for its sequential searching or “inking to 
ant ae ade yaw coe SN des 
the subfile containing the atribute, ‘and “henee, ‘increase the overall stactivty of the 


2 agg: a mr iy aiphs fete a ty Bee ag3 SRE: RS SRS z 
attribute to the other attributes in the file, Therefore, indie contribute >] re clustering 
o 30 wy ayke! comet cpap RG Red ae aS 


of attributes in the fite. ‘This point will again be taken up in Section 62 when we cas 


3-. 
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the relative efficacy of the trivial partition. 


In Table 8a we:see that the average cost of the partition fous’ bry the pairwise grouping 


heuristic is not-exactly the sarie-as that for the fast pitribie grouping heuristic. ‘Tt turns 


aut that for ‘one of the experiments with” m= 30, the outcome of the two pairwise 


| grouping hewrtstics'dted not coincide: ° “Although itis net Hipostibte that the results of the | 


two heurtsties ‘be different, in Section 53 we argied ‘hint this is highty improbable. 


However, it appears: that this: becomes more probable ab’ mm increases because the 
heuristies have to-iterate for mote steps atid the palfwise attractivity measures computed 
‘by the fast patewiae grouping’ heuristic in the first step becomes less likely to hold as the 


_ heuristic iterates for a large nurnber of steps. This was the case in. the one problem 


menttorred above. Le, a pair of attributes, neither: which participated in a grouping, 


- turned out to’ be unattractive in the fatter steps of the palrwise: grouping heuristic. 


Unfortunately it is precisely for these _protsiems: with large’ m that the fast pairwise 


grouping heuristic is Most needed. 


We have developed a third pairwise grouping heuristic to remedy this discrepancy. 


This heuristic (vee may call it the general pairwise grouping heuristic) is a combination 


of both the pairwise grouping heuristic and the fast’ pairwise grouping heuristic. The 


general pairwise grouping: heuristic starts with the trivial partition, and attempts pairwise 


grouping in the same way as the fast pairwise grouping ‘peuristtc does, but only for a 


‘limited: number of steps, say-k, steps. (THe fast patrwise grouping heuristic iterated for 
an unbounded ‘number of steps until it coukd frot produce an improvement to the fast 


"partition found.) - The partition that results after’ k, steps is then taken and the fast 
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pairwise grouping. capane is then reapplied to this partition but only for ke steps. 
This process is repeated feash time for a bnlted piwibes of. foepabonti: ats some point the - 
fast pairwise grouping heuristic preduers. a . partition . hak. it enon. improve upon. 
Even the Fast pairwise grouping. heuristic is reapplied, a ,pairwise attractivity 
measures of blocks are. recomputed; even pains © of blocks that. did net: ‘participate in a 


grouping since. the last Application of the Fast, : : ris © grouping eure wil have their ; 


attractivity measures recpmputed. In this manner, Mm he ate chooen such, that they 
are comparatively smaiter than the number # ape the polrwise sroveing heuristic is 


expected to iterate if , Appled, to the, same. praia, then, ‘ a 3 ke the one 


encountered in the above problem will‘ reget, ube, te. eiminned. he ke a are most 
effectively chosen if ky Bhkoz we 1, “because: as the proces. co pale grouping 
approaches: the optimal, partition, pairwise beck aracihy reeamures become smaller, 
and there is a greater chance that. a pair. of blocks ay be suractive s. one ‘step’ , but 


unattractive in another step. "For cramp in the pre ob 


' ry 15 90, we chose rr = 6, 
2=4, kg =2, and igs ks = “The-t reason, sa, the general pairwse grouping 
heuristic is a | combination of ‘the. two other. palenion romping: heuristics is that if 
ki akg @... = 4, then the general pairwhe grouping aust, wal: be the: “zame as the 
pairwise grouping, heuristic, while. it ky. is saben 10.be,ymbounded, then the general 
pairwise grouping heuristic will become the fast pairwise grouping heuristic, | 
We have tried the genera! pairwise grouping hreuraie an ail. of the. six: rere 
with: m=30. It always found the same persion a tte partion wat the pairwise 
grouping heuristic found, and on the average the, general patrwise, grouping heuristic 
required the evaluation of 1168 partitions and took 1596 seconds of processing time to 


aU ALA Mets Pg tab te 
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accomplish that.. This compares faverably with the pairwise grouping: heuristic where 
_ the respective figures are 3964 partkions and ne seconds. The processing time depends 
on how the parandetess k,, Kak. wit are soldered. "The targer that’ they are chosen to be, 
the faster the heuristic wif be, but the. smaiter: the probability that the eee found 


will coincide with the particion found by the pairwise grouping heuristic. 


In the next chapter we wil deacribe in deta one of the experiments that we have 
performed We wil provide the database perameer, | the. wage pattern parameters, and we 
wil show the sequence of partons found by the pairwie growping heuristic The problem 
we have chosen to elaborate in Chapter 6 is the one probtem wth m=8 for which the 


> pairwise grouping heuristic did not find the optimal pereiion. 
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CHAPTER 6 


AN EXAMPLE OF ATTRIBUTE. PARFETIONING 


We have presented the combination of the fast pairwise grouping heuristic and the" 
single attribute degrouping-regrouping eee as the main, ‘heuristic technique of our 
attribute partitioning system. . We fave ‘alo ree a umber *.. other attribute 
partitioning heuristics. In Section 7 of Chapter 5 we reported on n'the ru of a a core of | 
experimentation with the pairwise grouping heuriati the fast pairwise grouping | heuristic, 
and the single attribute degrouping- regrouping heuriat, We sped these three heuristics 
to a number.of usage pattern histories in the context of different database eivironsients.- We 
also performed a number of experiments on the other attribute partioning heuristics, and we 
reported the overall results of the experiments in the previous. chapter. ‘In summary, none of 
the other’ heuristics performed as well as the fast pairwise grouping heuristic; the fast 
dairwiee grouping heuristic sainaattia. produced the same partition as the pairwise grouping 
heuristic, while requiring only a fraction of its time. In all the cases that we tested, the fast 
" pairwise grouping heuristic consistently produced elther the optimal partition (as determined 
by exhaustive enumeration) or else a near optimal partition that differed: frome the optimal . 
partition by less than one percent. In ‘those cases where the resultant partition of the fast 
pairwise grouping heuristic . was nonoptinal, | by “using the single attribute 
dieg cou pingstegrouping heuristic to improve upon the resultant partition, we were able to 


obtain the optimal partition in at most three steps. 


Chapter 6 


| In order that the reader. develop a feet for the form and magnitude ‘of the 
partitioning problems that. we have considered and the heuristics weed to search for their 
solutions, we present an example of an attribute partitioning. problem in ide section along 
with the solutions obtained by applying a number of the attribute partitioning heuristics. We 
| have included only 8 attributes in the relation of this example (18) 140), so that we may 
arrive at the optimal Partition by exhaustive enumeration of. all possibilities The attribute 
fa hleaing problem considered here is typical of the problems. we have solved in. our work. 
It is also an example of a problem where the single attribute degrouping-regrouping heuristic . 
=a relevant. | | | : 7 | ) | 

Figure | shows. the database parameters: i, A, A, s, » and as | _ Figure 2 shows the 
usage paren query types as described in the query type table. The query type table contains 
the frequency of each query type, ‘the connectivity of the ‘predicates: (conjunction or 
aiajunciien) in the: selection component, the attributes in the selection component, and the 
attributes. in the projection component. The joint selectivity of the selection component 
(although not actually inckoded in the query type table) is also depicted. 

The results of applying the heuristics and the exhaustive enumeration procedure on 
this pertuar example are shown in Figures 6, The total cost of processing the set of 
queries, when the file is amipationes (Le., the one-fite partition), is 1e8i2ee pages (Figure 3). | 
. This processing cost was calcniaiea. by the file cost estimator in the. manner described in 
Section 4.4. Figure 4 shows the partition that results from applying the pairwise grouping 


heuristic; the fast pairwise grouping heuristic produced an identical result. Both heuristics 
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m=-8 = ‘The inumber of attributes ributes in the retation. 
n = 100,000 The norber of tuples ithe relation. 


Ant, 2,3, 4, 5, 6, 7, 8} ‘The attributes of the relation. 


{1, 3} =~ | The inidexed attributes 
S = 1024 words The system page size. 
7693 pages : The number of pages in the relation 
- Attribute lengths and selectivities: 
Attribute i Autrivwe bgt Acne seer 8, 
i 2° a4 . 
2 Y af 
3 6 Oh 
4 6 at 
§ 10 005 
6 10 210% 
7 20 5s iy 
& 20 


Figare 1 Database parameters. 
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- 133 - . 
Res Number of query pe 
~ Query types and frequencies: 


Euston Predicate Stn consume 


~ conj. 
conj. 
conj. 
conj. 
con}. 
conj. 
con}. 
CON. 
conj. 
COnj. 
. Con}. 
conj. . 
conj. 
. COnj. 
conj. 
COnj. 
con}. 
“COnj. 
conj. 


Ww 
(2) 


oa) 


{2} 
(2) _ 


— (45) 


67) 


1945678) | 


Figure2 Query type table. -— 


1.1998042E-3 
1,0990001€-2 
5.2487477€-3 


2:4999678E-4 


10899998 


9.2997131E-4 


10449998 


1.2097954€-3 


10449998 


 1,0099932E-3 


Chapter 6 : - 136 - An Example of Auribute Partitioning 


iterated for 4 steps. oe | 
~All the partitioning heuristics have ben program wogramming language 
MDL (26) The compiled. versions of 1 - ruclan xx aping, dreuristic. apd of the fast pairwise 
grouping heuristic resuted (tm the portion 2 me 1 pagan reepertivey,» and - 
ely. .. 
As it turns out, P?, ‘though rear optimal, ie not optim and differs from the 
"optimal partition by a negligtble. arsount. The exhoantive envueneration procedure (also 
programmed in MDL), found the, eptioa paren of Figere,B ater trying all 4140 


| required the file cost estimation of: Mend 9. 


possible partitions. “Fhe exhaustive enumeration 
time to generate all possible pertiions snd w cot eat thm in er to arrive at the 
optimal partition this is 23 orders of magnicade Hower then. a fax pening grouping 


required , 2905 seconds of CPU 


heuristic. . . | a 

- Comparing P° of wei 4 and P of Figure 5, we wee that in the near optimal 
partition P?, attribute gis Brouped with attribute 1, while in me were partition P*, 
attribute. 3 is grouped with attribute 8. “Hence when ‘the single attribute 
degrouping-regrouping heuristic’ was + tried on ra (Figure 5), it (produced the optimal 
partition P* in two steps and after the evaluation of 2 particions, (The second step being 
the test to see if the heuristic could iraprove upon ray ‘As may be obrerved from partitions 


p3 and P* , none of the other degreuping heuristics series to pe could have resulted in 


C({{1, 2, WS 6, 7, oy) = seizes — peges 


jose a. Cnet partion pestemanc cost. 
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‘Step “Partition : ft Phaay  Partit st in pages 
O | Pm C1}, (2h 2h (4) (5) 16) (71 (89) © XP®) = 2BE004 
i pl = {1}, {2}, 19) (458 16h 170 (Si) PHY 277870 
2 _P? = (1h (2) 13,7), (4,8) (6), 8)) XP) = 272887 
3 P= (U1, 4,5), (21, (9,72, 06,8 IP) 271840 
4 no improvement ne ee eee 

Figure 4 The results of the pe ug hayristic 
and tt pl png | 
0 P= {1, 4, 5}, (2), (3, 7), (6), (81) (P= 271880 
1 P* = {{1, 4,5), (2) {3,8) {6h (71) 9s XP 271814 


2 . no improvement 


oh 5 The result of the single attribute 5 degrouping regrouping heuristic. 
P* ts the opeteeal partition found Wy saalintve wiomneration 


p*. 
| The single attribute degrouping’ heuristic, the double’ attribare grouping heuristic, and’ the 
double’ attribute deqroiping Teereuping heuristic all” prétiuuced partitions with higher 
performance baie than P?. as | 

We also tried other grouping heuristics starting from P°. The triplewise grouping 
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heuristic resulted in a partition far less optimal than p?. The two moet attractive pairwise 
- grouping heuristi found the optimal partition P* , (akthough 4 hes not abways been the case 
r that this heuristic would, find the optimal partition). Although ; the k-partition pairwise 
“grouping heurisue with hea was able to fled the eptirend portions. P* it found the 
optimal partition after cost evaluating 120 partitions, considerably, more than the 67 partition 


| evaluations required by the combination of the fast | rw a grouping heuristic and the 


single attribute degrouping-regrouping heuristic. For files with a secs number of attributes, 
the disparity in search cont (oatween these two peas vatiea) 8 } 


The single attribute ungrouping heuristic was able to > pradece the optimal partition 
P* when started from the one-file partition. The converging sequence ts shown in Figure 6. 
This is in accord with our good experience with the single. attribyse,uogrouping heuristic 


0 Pn f42345,678) Oe 1881286 

1 Phen (0,3, 45,6, 7,8) 12H orp!) » 6ozee1 

2° PP 1,9, 4.5, 7,8h.f2), cr a : e _gapty green 

9 P= fl, 3, 4,8, 7), (21,66) (8H) CIP?) = 306239 
4 pee {{1, 3, 4, 5}, {2}, (6), (7, (813 7 cpt) «= 283959 

5S PP HL SL AZRAS BLL 7H OP 271814 

6 | no improvement | | 


Figure 6 The ressit of the single attribute ungrouping heuristic. 
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The single attribute ungrouping mestinke repre a total o 2 genie had adh time and 


terminated after conidering 17 partion. 


2. The E 


Syed ge tee A ee ea be “ft i hnae KONA 


In. the example presented above, we see that the trivia partition. P° in Figure 4 has 
1 surprisingly tow petormance cst whan compared withthe ae-le partion of Figure 8 
In fact; the: trivial partition in Most examples turns out to be relatively near optimal,:and for | 
Of the trivial partition can “be texa.then S9t:percemt-of att performance ‘cost of the one-file 
partition... In the: course: of. our: experiqentation, tt -was'also: soberwed that the relative 
optimatity. of: the trivial pecticion: imcreased a5:the number of ithe attributes in the relation 
increased. We offer the following -explanations: for the. refative:-opttmatity of the trivial 
partition: | | | 


(a) Most queries access $ ony a few of the eneibutes ‘For every query iti accesses a group 

of. attributes, there are ‘usually other queries that access a subset of this group of 
attributes (plus potty on scributes) and hence cause the group to » break apart (ie., | 
the subset of attributes becomes less atractiveb to the rest of the group) For example, if 
one. query ‘accesses attributes 1 and 2 and erates query accesses attributes 2 and 3, 
and beth: have the same: frequency, thien’ (depending conv the selectivities of tee queries) it 
is most likely that-4 trivial. Paitin 26 theistribuees WR aad: $ is'mmore cost effective 


‘than a onerfile or a-two-subfile partition of:them. 


(b) 


(c) 


| dominance of acne unselective queie . 


In order that two > itirthetes' be grouped tga in the same ec, they have to be 


accessed in the same query. The rhore selective the query fet the smatter the number of . 


tuples that satisfy the — predicate, the sree, the sci of grouping the 
attributes. Hf the query is very sateive, grouping Fas wlams cn vadaba number 
of page accesses mt hal the page stcenes thet wautbe  enccnennerted if ‘the. attributes 
were not grouped; therefore the ettribueee used ins selective. query are very attractive 
to one another. Conversely; if: cry be sther erecting the relative advantage 
gained. by grouping. das attrilouees is slight. Bor exayaple, 0 query is vo wnsdteative that 
it requires all the: poges of the subfite, thert grouping all the amtribetes:in.one subfite will 


incur no more page accemes.chan. separating there iene several subfiles (which: include 


no other attribwees) Themnfore, although greuping ettribates ;becomet: more. devirabie 
attributes which has been induced by the oclective queries. On the other hand, selective 
queries incur fewer. Page accesses thar matire meee and the pera cost of a 


ty Bee te. 


partition is determined toa rear ‘extent » unseiective queries Tous for a usage 


pattern where selective and unselecive queries are represent with the same ¢ frequency. | 
fegetgedy gts 


the optimal partition will be leot to the trivial | partion in order pe “reflect the 


An attribute that, is resolved ‘by: sequential! searching ‘or by: linking: from:a‘TID list 
which contains a large fumber of FIDs (when: compated: te the-setak svamiber ‘of tuples 


in the relation) wilt manifest a smaller: ateractteity to the-other attributes.of the file. In 


contrast, an attribute that fs indexed will have a relatively higher attractivity measure to 


ame ener ine 
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the other attributes of the file. This is because when, an. attribute is resolved by 
sequential searching or by linking from a large TID list, all the values for that attribute 
will have to be retrieved in their entirety. Any attribute that may co-exist with this 
attribute in the same subfile will also have to be retrieved in its entirety, even if this 
attribute is not requester in the query that requests the unindexed attribute. Hence (as 
we have ‘noted in Section 5.7), providing indices on attributes wil contribute to the 
clustering of the attributes. In the experiments we have conducted (and also in most 
real databases), only a few of the file's attributes were indexed and. most attributes did 
not have an index. Therefore for such cases, the trivial partition will be relatively 
| optimal, since ihe attributes that are not indexed will be unattractive or: ikaw ‘little 
“attractivity to the rest of the attributes of the file, and the optimal partition will be 


‘composed of a large number of subfiles, each subfile containing only a few attributes. 


The merit of the trivial partition has not been overlooked in practical. database 
design. A variant of the trivial partition in a non-flat file implementation of a database is a 
file organization where the main file consists of records of pointers, in which each pointer 
points to the actual attribute value stored in a ssctelry data area (25) The secondary data 
"area is separate from the main file, and values of different attributes are stored separately. In 
‘this manner, the attributes are separated from one another into their own exclusive areas ina 
way that vedemibles a trivial partition. Accessing one attribute will therefore not line in any 
values of other attributes (except for the pointers which are relatively short), but will cause 
one extra page access since the pointer has to be followed to the actual data. In such file 


organizations, data compaction techniques (like eliminating redundant copies of the same 
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attribute value) becomes possible, and are often employed to reduce storage (and also access) 


requirements (but at the expense of computing time). 
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CHAPTER 7 _ 


REMARKS AND FUTURE DIRRCTIONS — 


In this Teport we have outlined » selfadaptive fetational database management 
system that. performs attribute partitioning: far’ shigie relaeion “databases ‘We have also 
developed. a number of compeitationa iy fensthte ‘attribute partitioning heuristics ‘that select a 
eae optimal : partition fer: the: -atrtbooe, (wit the gout Ot" Spetmiaing ‘on the “paging 
‘performance of the database. Ia the: escail section, we provide: ‘suggestions for extending the 
undertying environment in -order to. solve ‘more -realtatic ‘attribete: partitioning: ‘problems, 
conclude this. report with ‘suggestions for. extending char sens Of “attribute partitioning to 


wider. problems, along witha ‘discusston:-of the relationship’ between: database attribute 


partitioning. and other physical database design issues. 


“There. are numerous. issues and parameters that ‘have to’ be considered when | 


optimizing the physteal-design: of: a database ira ‘complex eiiviro ent. In‘this report we 
have addressed the attribute paititioning: probiem inthe context ofa model of the database 

management system that incofporates &° number of ‘these /parktheters that we feel are more 
important and whose consideration fs ‘required in order to have a rieaningful model of 


.practical database usage. Two of these parameters are attribute selectivity and the blocking 
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of tuples into discrete pages. “However, it 19 possible ye to treorporate into our model of the 


database management: “apteey qo oneber ret ‘» * alee increasing ‘the 


complexity of the attribute partitioning problem, ahd without: introducing the need for 
revamping the heuristics we have shown to be appropriate i our current model. Inv the. 
-axtentstons to our ctitfert -faodel suggetted, below, we-believe thet “the: pairwise grouping 
heuristic witl still ‘be. the mest viable. fresrtente: medi eee: fe: near optinatity «as 
of experimentation with the: hewrltc, showtng th the pasta preduted ‘bythe pairwise 
optimal partition coutd oth. be emeified, rien $8 f aecbeR aru: found for thet 
problem). However, it.4s possible that-in: a: son eOi tee stint haga ‘ofa 
Pair of blocks wit!wot be. the augne:-from Die’ step Of abe: palewiee grouping heuristte:to: the 
NER step (Le. violating qondition 55.4), this predating the appiiention“of she:tas palriise 
grouping heuristic. bet rears iit cae heh ss 
| One parameter: that would be derable to cde ar motel the overhead co cost 
of sacl a wbfile, ‘When a ay au database, there ts wnaly 8 a degree of 
overhead associated with accessing each uit, The overhead sony be inured. in opening 
and closing the subfite, initializing the page: map table for the sabfile, or allocating buffer 
areas in primary memory. for the subfile.. _Qvechangh charge cane incorparated in our 


model by having the file chet estimator adda. fined overhkes 


whenever a subfile is accessed by the method.of a-query. .. 


One other way we may extend our current model # to differentiate between the cost 
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of accessing different subfites and to differentiate between: the forage | cost for: different 
: subfiles. It is possible t to store subfiles on a variety of storage devices with »ditering storage 
and access costs, in order to further reduce the database sul alts cost. For example, a 
subfile that is not frequently accessed and that requires a significant amount of storage may 
be stored on a secondary storage sevice, other tag a disk, that has Higher access cost but 
lower storage cost. ‘Differentiation between the o costs sof subfiles has been the main concern of 
some previous studies that have’ viewed the attribute , partitioning “problem in: terms of 
silocdting: attributes to ations a secondary subfiles' (Benner {4], Hoffer 018}, Elsner and 
Severance (I4], and Marchand Severance [23]) We may: tmcorporate differentiated subfites in 
our model by making the: file: cost cutimator tiki the number of page ‘accesses to a subfile 
by a factor that reflects the:retative gost of accessing’ the pages of the subfite with respect to 
the other subfiles. Differentiaged storage costs for subfites may 2130 be included in our model 
_ by adding up the storage.costs for eacty subfite depending on te kind of device the subfile {s 


stored on during the time interval cada points. . 


r would be desirable to have the. sytem ancien and optimally determine the 

repartitioning points. Computing the bi sores points should be based upon the 
consideration that too frequent Invocation of the partitioning heuristics will result in 
expect a significant amount of comparators! resources on the search for a 1 marginally 
better atipute partition, mite infrequent bnivercenion of the nears may result in degraded 
. database pilacliatis in the intervening time between repartiioning. points. Shneiderman 
(33) and. Yao et al. (38) have investigated ‘the _ problem of optimally determining 


ial ale acta points for the purpose of baie overflows in files that are due to tuple 


"Chapter 7 | alee a Remarks and Future Directions 


insertion ‘and deletion. They also present a number of technique for. this purpose ‘We feel 
that their sechnigees could be sag te the ecm of cement <a reparitioning 


points 


_ Even: though our. Invenlgaion of wtbus:partoning ts tt many respects wore 
‘comprehensive than previo avadies, we: ‘have: onby camnidered the: ateribute ‘partitioning 
problem within the context-of » single relation ates srvirommann, and where the relation 
_ is-accessed through a rastejcved interact. nia enpates forthe seeion of dat. “Fo 
fully realize the Pleibiity ofa. relational atabenecitte vecmneyt0 concider e&: ‘mmakieelation 
: environment together with a high-tevat ronprocedura angungs intert that pees queries 
with arbitrary join operations . between retations: Ar: thates -qemeaton pares, ‘Also,- tt ts 
necessary to consider queries that have srbirary boraan exereions the any combination. of 


conjunctions and. ae in their — part, and in which the predicates of the 
boolean expression contain other compariton ian sca caus Beatie | (Arbitrary 
boolean expreision in » queries have been “diated in our motel eanty because of of the 
complications that arise in the evahuation of such oun 6 for ‘ ee database. “Alo, 
the consideration of comparison operators ‘other ran equay condions rake the problem 
of estimating the number of ‘upte that satay a wey ; camasaarabiy a9 more ‘comple, though 
not infeasible) ‘To our knowlege, the attribute paring pri fora ma ehton 
database with. governed queries has not tye been — : 


afer 


We -have assumed a fiat file physical implementation of a relation in our attribute 
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partitioning system. In most ‘practical databases, nonfat file Implementation of data is 
assumed for two reasons. I- For the purpose of data compaction, in order to reduce the 
aintune: of sorage required to store the Gata. Dats compaction may ‘be accomplished by 
replacing attribute values with pointers. that poi to ) a common but smaller pool of the actual 
attribute values (dat encoding), by using pointers that point to the end of a variable Jength 
2 attribute value (so that the maxieure storage does not have to be allocated for each of 


the variable length attributes), or by “t out the same attribute value from any 


number ¢ of tuples that t contain the attribute rahe, in order: to eliminate its repetition in more 
than one tuple (as in _flernrchical organizations). z For the purpose of. access enhancement. 
Access enhancement ina rronflat mplorentation may be accomplished by the elimination of 
join operations. ie searches) on subfiles wy. having. ‘uples that satisfy a frequent join, 
operation. explicitly linked to, one another by pointers. We believe that the development of 
attribute partitioning heuristics for nonfat file implementations will net be an easy task; this 
‘tical is compounded by me complexity of query eratation in such file organizations and 
by we problems of Set an. accurate cont model ed such a database management 
environment. Babad [2] and Benner @ have addresed a Aieited form of the attribute 
partitioning or a” a mont file Armplementtion val a relation “Specifically, they consider - 
the problem of aeribute A tesbppenice, for file organizations that — attribute values with 
variable Aengths, and where not all subtuples are of equal length. ‘Schkolnick (31) considers a 
hierarchical file ‘organization where “— — in the gl se an REET: Page access is 
minimized 1 by partitioning. me nodes of the erate fie according to the Frequency of 
attribute requests by queries and according to the position of the attribute within the 


" hierarchy. 
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We have confined ‘our attention: to problems of: attribute arcane: Attribute 
clustering, as the logical extension’ of attribute partitioning, an resut in considerable access 
(and storage) cost reduction over stmple. partitionting. Ae an "example where attribute. 
. clustering may vesuk inv secess Cont reduction, consider etartbutes by spent where. ei: 

. highly attractive to ‘both: a, and “as but where: & ‘end ‘ey: are ighy unattractive. In 
such a situation ‘it ts estate 10 reddy sore 4 nm tro subi: ene pubfie with ay 
and in the other subfite with Og. Fuithermore, io som com, stfbute chunering can | 
paradoxically resutt int the reduction of storage conte: West of attributes of a ‘relation is 
reproduced in a subt of te aiibone cer, sch that alte ether strate ofthe subfie 
‘are functionally dependent on these reproduont ‘eros, then ts spoasibie to rephce 
“linking as an access path te subtuplen ofthe te by an mp jotn om te reproduced set of 
“ siiribvates. In other words; for # given nubtaple, irorder 0 coe core ina subfile, 

the. equi-join of subeapler tn the subfile te then ‘WRN the given subeople on the set of 
reproduced attributes. Stree the set of reproduced ateributes censumutes 2 hey for the subfite, 
the resttt of the equi-join ts a 1 single subtuple wivich is the cmap ofthe given subtuple. 

The consequence of eliminating the ina hat the ote contaning the reproduced expy 
_ of the attributes may be compacted a) dieinating att identica! rubeapes in certain cases this 
will result in considerable recovery of spare sorage aa the bepiet of the subfite- become 


unique But there are more direct techniques for sing worage (ouch: as vdete encoding) that 


are better for this purpose than scribe cunering. ‘Theretore seribute chasing should be - 


d-y ee ype ae as 


| primarily considered for the purpose of acces: oot smanionization.. ‘Omnan (29) considers a 
special case of attribute clustering ina single relation database envirenement where the 


primary key of the relation ts reproduced in each of the subfites. The original relation may 


a Fes my emer 
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_ then be. meena from the equi-join of the collection of the subfiles on the primary key. 
This model may be construed as a Variant of. mein partitioning where the linking access 


d ducing the primary key all 


path is. replaced by. the c equi ‘pin operation, since, by. repro 
subtuples become unique each subfile has an _ subeupls. and a subtuple has a unique 
co-subtuple in gach of the other subfiles. Hotter 2 08) also addresses the problem of attribute 


clustering ina flat file “implementation of a relation and. onstruc : pan integer Programming 


tomenation of the problem, together with a branch and ume atgorithn for its solution.. 


A problem related i iia yay sci eid eis is their dual: 
tuple: partitioning (or hottzontal’ partitioning, where the fite’ is pattitioned ‘horizontally by 
‘tuples.) ‘Tt is well known that tuples are:not requested by qudttes with ‘ufitform frequency: in 
actual apptications it has bese frequently observed thitt the tiple request distribution follows 
the "80-20" rile of thumb’ O72 “This rute ‘seatés that” 80% ‘of ‘the queries deal with the most 
active 20£°of a file. Therefore ‘considerable ‘ackess enthaticement may be accomplished by 
clustering. the frequently accessed tuples togetliér, separate from the infrequently accessed 
tuples. In its simplest form, tuple partitioning is accomplished ina flat file implementation by 
grouping tuples that are accessed together into horizontal suibfiies. ‘The reason such a subfile 
is called a horizontal subfile is:becatse the tablé representing the file ts being partitioned 
horizontally by its rows: (Similarly, a subfite of at attribute partition ‘may be viewed as a 
_ vertical: subfile.) In tuple partitioning, in’ order that the access cost of two'similarly accessed 
tuples be redneed: the tuptes have to be placed’ in the sane page. Tuple partitioning bay 
thus be viewed as shuffling the fupted“atnong the pages of the’file, placing similarly accessed 


tuples in the same page, so that total access cost’ is minimized.’ We feel that the attribute 
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| partitioning: heuristics “sanvaja’ in this report < be compton infeasible if ever 

applied to the tuple patiioning problem because f the shew numberof tuples involved, In | 
practical databases, the number of tuples tn tive dniabbase is-frequently as large as 10° , For 
the same ream, appears rather wey set's mpi lation hrevritc that 
considers each Indiendeal tuple can be develope aod mca applied to the tuple 
partitioning probiem. Another source of diffautry fn che maple parioring problem ts that a 
subfile in a tuple partividn & timated im capecty tn itt 8 pags The heuristics we 
have developed do not seks, pach sonstradmts jt ' maiiorn 


- Krvuth (223 and Rivest (90) each dexcrtbe a heuris 


to their access frequencies. .In, the heuristic of Knuth, gach tuple that.is:retrieved by a-query 
is relocated to the top of the file suc chat the, ple. leanne the fir tuple tebe searched 
by the next query. In the heurtate, of Rivest, a tuple thet ip- accessed da exchanged sith. the 
immediately preceding tuple. Both, heyristics are rath 4m apope:pince they assume no 

blocking of tuples into pages am they. ssi tat the only. ay thatthe fle te searched is by | 


sequential search. aM heuristics also have the dramberk. ‘that eg: are: vaty senttive to the 


eae SE are ae PRICE 


order the queries are jibe ta the database, and the. seri man 
database can drastically change if the queries arrive al a alight diferent order. 
We believe that the tuple partitionia j,, problems. should. be ‘solved using chaster 


analysis techniques that consider bath the attribute values in. the tupte.and te-occurence. of 


attribute values in queries. Statistics gathered. for the purpose.of tuple partitioning must 
_ record not only the attributes requested by the queries, made to the database, ,but also the 
attribute values in the equality. condition presdicates of the query, 20 that, similarly accessed 


tuples may be identified. 
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| -A sight generalization of the tuple partitioning problem is to the case where tuples 

are stored in pages that need not be ery Aited, but may be partly empty. This 
variation of tuple. partitioning is even more difficute to solve, but may result in enhanced 
access performance at the cost of an incense in sorage requirement. Tuple partitioning may 
- be further generalized into tuple chutering where tuples may be. redundantly stored in 


several pages, in order to D reduce page accesses (while increasing, perege requirements). 


A further generalization of the partitioning probleri’ is ‘the hybrid clustering 
(partitioning) problem where attribute chasteriing ‘(pattitioning) ‘and’ tuple clustering 
(partitioning) are carried out simultaneously. “In this: ‘problem; 2 file is’ partitioned both by its 
attributes and by its tuples such that each subfile has a subset of the attributes and a subset 
of ihe tuples. The subfites of a hybrid partition need not all be of the same size either in the 
number.of attributes or in the number of tuples. One way to picture a flat file partitioned in 
a hybrid manner is as a composition of reagent mosaics, with varying fengths and widths, 
placed adjacent to one another such that the whole file is covered. The’ hybrid data clustering 
(partitioning) problem is much larger than either the attribute clustering or the tuple 
“clustering problems. For this reason, its solution requires more powerful heuristics than any 
we have considered. A computationally feasible, yet. not necessarily ceili or near optimal, 
approach would be to perform attribute clustering (partitioning) ‘and tuple clustering 
(partitioning) alternately, in order to reach a hybrid cluster that has a locally minimum 


performance cost. 


Another direction of extending attribute partitioning is to consider attribute 


; partitioning simultaneously with selecting other file access structures, where the choice of the 
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access structure depends an dhe cele the tte pation, aod mice worsa. sal egies 
we may “consider choose indices for the attributes ofa: partiomed fe In our current 
model, we have assumed the st of starsat re tne s0 preemie, and 
consequently the fite 4 prrthinned based on a feed a af tion. M8 the sentation of fixed 
indices is removed, the overall performance of the database management system mnig et . 
improve when attribute partitioning and inden sctection se comsidared simukaneously. One 
plausible strategy here is to sernately perform atiribute pattibioning end tacdex selection ina 
stepwise minimization fashion. “The heurtatic premmned by oecriranacsinae tescmane in 
of fadiers is particularly “att far. such a strategy. 
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